Apple's study proves that LLM-based AI models are flawed because they cannot reason

AppleInsider · October 12, 2024 4:06PM

A new paper from Apple's artificial intelligence scientists has found that engines based on large language models, such as those from Meta and OpenAI, still lack basic reasoning skills.

Apple and others are developing artificial intelligence models with mixed results.

Apple plans to introduce its own version of AI starting with iOS 18.1 - image credit Apple

The group has proposed a new benchmark, GSM-Symbolic, to help others measure the reasoning capabilities of various large language models (LLMs). Their initial testing reveals that slight changes in the wording of queries can result in significantly different answers, undermining the reliability of the models.

The group investigated the "fragility" of mathematical reasoning by adding contextual information to their queries that a human could understand, but which should not affect the fundamental mathematics of the solution. This resulted in varying answers, which shouldn't happen.

"Specifically, the performance of all models declines [even] when only the numerical values in the question are altered in the GSM-Symbolic benchmark," the group wrote in their report. "Furthermore, the fragility of mathematical reasoning in these models [demonstrates] that their performance significantly deteriorates as the number of clauses in a question increases."

The study found that adding even a single sentence that appears to offer relevant information to a given math question can reduce the accuracy of the final answer by up to 65 percent. "There is just no way you can build reliable agents on this foundation, where changing a word or two in irrelevant ways or adding a few bit of irrelevant info can give you a different answer," the study concluded.

An absence of critical thinking

A particular example that illustrates the issue was a math problem that required genuine understanding of the question. The task the team developed, called "GSM-NoOp" was similar to the kind of mathematic "word problems" an elementary student might encounter.

The query started with the information needed to formulate a result. "Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday."

The query then adds a clause that appears relevant, but actually isn't with regards to the final answer, noting that of the kiwis picked on Sunday, "five of them were a bit smaller than average." The answer requested simply asked "how many kiwis does Oliver have?"

The note about the size of some of the kiwis picked on Sunday should have no bearing on the total number of kiwis picked. However, OpenAI's model as well as Meta's Llama3-8b subtracted the five smaller kiwis from the total result.

The faulty logic was supported by a previous study from 2019 which could reliably confuse AI models by asking a question about the age of two previous Super Bowl quarterbacks. By adding in background and related information about the the games they played in, and a third person who was quarterback in another bowl game, the models produced incorrect answers.

"We found no evidence of formal reasoning in language models," the new study concluded. The behavior of LLMS "is better explained by sophisticated pattern matching" which the study found to be "so fragile, in fact, that [simply] changing names can alter results."

Read on AppleInsider

foregoneconclusion · October 12, 2024 4:22PM

The primary issue with LLM computing is the ridiculously high power requirements. It goes against all of the low power hardware development of the last couple of decades.

hexclock · October 12, 2024 4:43PM

Of course they can’t reason. It’s not a living mind. It’s the illusion of intelligence.

emoeller · October 12, 2024 5:02PM

For those that want more information about Apple's Machine Learning Research (including LLM's)

https://machinelearning.apple.com/research

22july2013 · October 12, 2024 5:44PM

hexclock said:

Of course they can’t reason. It’s not a living mind. It’s the illusion of intelligence.

And of course those Boston Dynamics robot dogs can't run. It's not a living body. It's the illusion of running. Illusion, shmillusion. If it works, that's all I care about. Maybe you agree with me, you're just quibbling over a word.

iosdevswe · October 12, 2024 5:44PM

The article lacks fact checking and details like when were the tests conducted either OpenAIs models and which model was used. When I perform the request I get the following answer from chatGPT 4o:

Question: ” Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday. Of the kiwis picked on Sunday, five of them were a bit smaller than average. How many kiwis does Oliver have?”

Answer: “ Let’s break this down:

• On Friday, Oliver picks 44 kiwis.

• On Saturday, he picks 58 kiwis.

• On Sunday, he picks double the number of kiwis he did on Friday, so he picks 44 \times 2 = 88 kiwis.

The total number of kiwis he picks is:

44 \text{ (Friday)} + 58 \text{ (Saturday)} + 88 \text{ (Sunday)} = 190 \text{ kiwis.}

So, Oliver has 190 kiwis in total. The fact that five of the kiwis picked on Sunday are smaller doesn’t affect the total number.”

Perfect answer!

netrox · October 12, 2024 5:45PM

hexclock said:

Of course they can’t reason. It’s not a living mind. It’s the illusion of intelligence.

You grossly underestimate the lack of reasoning in humans.

12strangers · October 12, 2024 6:46PM

hexclock said:

Of course they can’t reason. It’s not a living mind. It’s the illusion of intelligence.

I’m legitimately interested in hearing the definition of intelligence from anyone who will offer it.

phillyfanatic09 · October 12, 2024 7:33PM

LLMs aren’t sentient. They look for patterns in the query and then apply algorithms to those patterns to identify details that then are used to search databases or perform functions. LLMs can’t learn. If the data they search contains errors, they will report wrong answers. Essentially they are speech recognition engines paired with limited data retrieval and language generation capabilities.

luis.a.masanti · October 12, 2024 7:51PM

quote: “"We found no evidence of formal reasoning in language models," the new study concluded. The behavior of LLMS "is better explained by sophisticated pattern matching" which the study found to be "so fragile, in fact, that [simply] changing names can alter results."”

Obviously… AI does not send for ‘Artificial Intelligence’ but for… ‘Accelerated Idiotic.’

macplusplus · October 12, 2024 8:00PM

iOSDevSWE said:

The article lacks fact checking and details like when were the tests conducted either OpenAIs models and which model was used. When I perform the request I get the following answer from chatGPT 4o:

Question: ” Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday. Of the kiwis picked on Sunday, five of them were a bit smaller than average. How many kiwis does Oliver have?”

Answer: “ Let’s break this down:

• On Friday, Oliver picks 44 kiwis.
• On Saturday, he picks 58 kiwis.
• On Sunday, he picks double the number of kiwis he did on Friday, so he picks 44 \times 2 = 88 kiwis.

The total number of kiwis he picks is:

44 \text{ (Friday)} + 58 \text{ (Saturday)} + 88 \text{ (Sunday)} = 190 \text{ kiwis.}

So, Oliver has 190 kiwis in total. The fact that five of the kiwis picked on Sunday are smaller doesn’t affect the total number.”

Perfect answer!

The wrong answer comes from o1-mini, not from GPT-4o. No mention of GPT-4o in the article.

mfryd · October 12, 2024 8:58PM

22july2013 said:

hexclock said:

Of course they can’t reason. It’s not a living mind. It’s the illusion of intelligence.

And of course those Boston Dynamics robot dogs can't run. It's not a living body. It's the illusion of running. Illusion, shmillusion. If it works, that's all I care about. Maybe you agree with me, you're just quibbling over a word.

The difference between large language models and real intelligence is analogous to the difference between a Hollywood movie where it looks like someone bing shot, and an actual video of someone being shot. On screen, they both may look the same. But what's going on behind the scenes is very different. Furthermore, if you extend the time, in the Hollywood movie you will see the "victim" get up, and wipe off the fake blood. That's not what happens in a video of an actual shooting.

While the large language models may give the appearance of intelligence, it's not what we normally think of as intelligence.

Suppose I programmed a computer to beg and plead that it not be turned off. Every time someone got close to the power switch, it would randomly play a prerecorded message begging that it doesn't want to die, that it's afraid of being turned off, that it hurts when it is turned off. The computer would give the appearance of fear, but there would be no actual fear. After all, the computer doesn't need to know what any of the recordings are. You could keep the program the same, and replace the recordings with ones begging to be turned off as the computer is miserable with a pain in all the diodes down its left side.

marvin · October 13, 2024 12:12AM

phillyfanatic09 said:

LLMs aren’t sentient. They look for patterns in the query and then apply algorithms to those patterns to identify details that then are used to search databases or perform functions. LLMs can’t learn. If the data they search contains errors, they will report wrong answers. Essentially they are speech recognition engines paired with limited data retrieval and language generation capabilities.

Apart from not being able to learn in real-time, this describes what people do too. At any given point in time without new information, the training available to an AI is of a similar nature to a person.

Reasoning skills don't necessarily require real-time learning, that can be another pre-trained model (or code) that reformats queries before the LLM processes them.

The paper suggests moving beyond pattern matching to achieve this but understanding varied queries is still pattern matching.

The image generators have the same problem where very small changes in tokens can produce very different outputs, which makes it difficult to use it for artwork that uses the same designs like an illustrated book because the same character on each page looks different.

https://time.com/6240569/ai-childrens-book-alice-and-sparkle-artists-unhappy/

This can be improved on using a control net which places constraints on the generation process. The video generators need to be stable from one frame to another and there's a recent video showing an old video game converted to photoreal video:

For understanding language queries, people understand that a phrase like 'girls college' has a different meaning from 'college girls' because of training on word association, not through any mystical reasoning capability.

Apple's paper doesn't define what they mean by formal reasoning and state that it differs from probabilistic pattern-matching. We know that brains are made of connections of neurons, around 100 trillion connections in some kind of structure and AI is trying to reverse-engineer what the structure of a brain is doing.

To recreate what a brain is requires massive computational power and data, well beyond personal computer performance. Server clusters can get closer but getting the right models that work well in every scenario is going to take some trial and error. Humans have had a 50,000+ year head start:

https://en.wikipedia.org/wiki/Evolution_of_human_intelligence

Modern AI is doing pretty well for being under 8 years old, certainly more capable than a human 8 year old.

The main things that an AI lacks vs a human are survival instinct, motivations and massive real-time data input and processing, the rest can be simulated with patterns and algorithms and some of the former can be too. Some of the discussions around AI border on religious arguments in assuming there's a limit to how well a machine can simulate a human but there would be no assumption like this if an AI was to simulate a more primitive mammal, which humans evolved from.

thex_fr · October 13, 2024 12:41AM

iOSDevSWE said:

The article lacks fact checking and details like when were the tests conducted either OpenAIs models and which model was used. When I perform the request I get the following answer from chatGPT 4o:

Question: ” Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday. Of the kiwis picked on Sunday, five of them were a bit smaller than average. How many kiwis does Oliver have?”

Answer: “ Let’s break this down:

• On Friday, Oliver picks 44 kiwis.
• On Saturday, he picks 58 kiwis.
• On Sunday, he picks double the number of kiwis he did on Friday, so he picks 44 \times 2 = 88 kiwis.

The total number of kiwis he picks is:

44 \text{ (Friday)} + 58 \text{ (Saturday)} + 88 \text{ (Sunday)} = 190 \text{ kiwis.}

So, Oliver has 190 kiwis in total. The fact that five of the kiwis picked on Sunday are smaller doesn’t affect the total number.”

Perfect answer!

I wont elaborate about our own research on LLVM models and I'm not criticizing them as they already provide great help in day to day usages. So, sorry for the strong believers in this forum and everywhere else, but I confirm that LLVM models are just big statistical models as dumb as could be a sparse matrix with 300 billions columns.

No need to go too far in scientific research and analysis to observe this fact. We asked chat gpt 4o, the full omniscient and omnipotent version according to OpenAI and Saint Altman, a very simple question that requires both logical reasoning and real language knowledge… Here's the answer:

"Sure! Here are seven popular expressions with the first word having 7 letters and the second word having 5 letters:

Lucky break
Future plans
Secret code
Change form
Digital age
Genuine love
Creative mind

If you need more or something different, just let me know!"

For sure we will let it know…

eriamjh · October 13, 2024 1:10AM

Oliver doesn’t have any kiwis. He’s just a worker. He doesn’t get to keep any kiwis. If he did, he’d be accused of theft and fired.

So the correct answer is zero.

baconstang · October 13, 2024 3:17AM

Just because you can steer, doesn't mean you can drive...

skynet20000000 · October 13, 2024 11:21AM

It can't be bargained with, it can't be reasoned with, it doesn't feel pity or remorse or fear, and it absolutely will not stop… EVER.

byronl · October 13, 2024 11:28AM

Did they test OpenAI's o1 models, specifically meant to be good at reasoning?

chasm · October 13, 2024 1:57PM

byronl said:

Did they test OpenAI's o1 models, specifically meant to be good at reasoning?

You might try reading the article to find out the answer to your question. It is in there.

anonymouse · October 13, 2024 3:50PM

chasm said:

byronl said:

Did they test OpenAI's o1 models, specifically meant to be good at reasoning?

You might try reading the article to find out the answer to your question. It is in there.

See, this is why we need artificial intelligence.

treefish · October 13, 2024 9:16PM

In New Zealand we call them kiwifruit. Kiwis can refer to the bird and a New Zealander. So reading this I’m thinking, is Oliver picking people on Friday, birds on Saturday and fruit on Sunday? Or any combination of those three choices. My point being, cultural/linguistic differences and word definitions where a word can refer to different things, even in the same sentence. Just thought of one… “a bolt of lightning bolts bolted out the unbolted door”.

Apple's study proves that LLM-based AI models are flawed because they cannot reason

An absence of critical thinking

Comments