Apple's study proves that LLM-based AI models are flawed because they cannot reason
A new paper from Apple's artificial intelligence scientists has found that engines based on large language models, such as those from Meta and OpenAI, still lack basic reasoning skills.
Apple plans to introduce its own version of AI starting with iOS 18.1 - image credit Apple
The group has proposed a new benchmark, GSM-Symbolic, to help others measure the reasoning capabilities of various large language models (LLMs). Their initial testing reveals that slight changes in the wording of queries can result in significantly different answers, undermining the reliability of the models.
The group investigated the "fragility" of mathematical reasoning by adding contextual information to their queries that a human could understand, but which should not affect the fundamental mathematics of the solution. This resulted in varying answers, which shouldn't happen.
"Specifically, the performance of all models declines [even] when only the numerical values in the question are altered in the GSM-Symbolic benchmark," the group wrote in their report. "Furthermore, the fragility of mathematical reasoning in these models [demonstrates] that their performance significantly deteriorates as the number of clauses in a question increases."
The study found that adding even a single sentence that appears to offer relevant information to a given math question can reduce the accuracy of the final answer by up to 65 percent. "There is just no way you can build reliable agents on this foundation, where changing a word or two in irrelevant ways or adding a few bit of irrelevant info can give you a different answer," the study concluded.
An absence of critical thinking
A particular example that illustrates the issue was a math problem that required genuine understanding of the question. The task the team developed, called "GSM-NoOp" was similar to the kind of mathematic "word problems" an elementary student might encounter.
The query started with the information needed to formulate a result. "Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday."
The query then adds a clause that appears relevant, but actually isn't with regards to the final answer, noting that of the kiwis picked on Sunday, "five of them were a bit smaller than average." The answer requested simply asked "how many kiwis does Oliver have?"
The note about the size of some of the kiwis picked on Sunday should have no bearing on the total number of kiwis picked. However, OpenAI's model as well as Meta's Llama3-8b subtracted the five smaller kiwis from the total result.
The faulty logic was supported by a previous study from 2019 which could reliably confuse AI models by asking a question about the age of two previous Super Bowl quarterbacks. By adding in background and related information about the the games they played in, and a third person who was quarterback in another bowl game, the models produced incorrect answers.
"We found no evidence of formal reasoning in language models," the new study concluded. The behavior of LLMS "is better explained by sophisticated pattern matching" which the study found to be "so fragile, in fact, that [simply] changing names can alter results."
Read on AppleInsider
Comments
https://machinelearning.apple.com/research
Answer: “ Let’s break this down:
Perfect answer!
Obviously… AI does not send for ‘Artificial Intelligence’ but for… ‘Accelerated Idiotic.’
While the large language models may give the appearance of intelligence, it's not what we normally think of as intelligence.
Suppose I programmed a computer to beg and plead that it not be turned off. Every time someone got close to the power switch, it would randomly play a prerecorded message begging that it doesn't want to die, that it's afraid of being turned off, that it hurts when it is turned off. The computer would give the appearance of fear, but there would be no actual fear. After all, the computer doesn't need to know what any of the recordings are. You could keep the program the same, and replace the recordings with ones begging to be turned off as the computer is miserable with a pain in all the diodes down its left side.
No need to go too far in scientific research and analysis to observe this fact. We asked chat gpt 4o, the full omniscient and omnipotent version according to OpenAI and Saint Altman, a very simple question that requires both logical reasoning and real language knowledge… Here's the answer:
"Sure! Here are seven popular expressions with the first word having 7 letters and the second word having 5 letters:
If you need more or something different, just let me know!"
For sure we will let it know…