Apple's study proves that LLM-based AI models are flawed because they cannot reason

Posted:
in General Discussion

A new paper from Apple's artificial intelligence scientists has found that engines based on large language models, such as those from Meta and OpenAI, still lack basic reasoning skills.

Apple and others are developing artificial intelligence models with mixed results.
Apple plans to introduce its own version of AI starting with iOS 18.1 - image credit Apple



The group has proposed a new benchmark, GSM-Symbolic, to help others measure the reasoning capabilities of various large language models (LLMs). Their initial testing reveals that slight changes in the wording of queries can result in significantly different answers, undermining the reliability of the models.

The group investigated the "fragility" of mathematical reasoning by adding contextual information to their queries that a human could understand, but which should not affect the fundamental mathematics of the solution. This resulted in varying answers, which shouldn't happen.

"Specifically, the performance of all models declines [even] when only the numerical values in the question are altered in the GSM-Symbolic benchmark," the group wrote in their report. "Furthermore, the fragility of mathematical reasoning in these models [demonstrates] that their performance significantly deteriorates as the number of clauses in a question increases."

The study found that adding even a single sentence that appears to offer relevant information to a given math question can reduce the accuracy of the final answer by up to 65 percent. "There is just no way you can build reliable agents on this foundation, where changing a word or two in irrelevant ways or adding a few bit of irrelevant info can give you a different answer," the study concluded.

An absence of critical thinking



A particular example that illustrates the issue was a math problem that required genuine understanding of the question. The task the team developed, called "GSM-NoOp" was similar to the kind of mathematic "word problems" an elementary student might encounter.

The query started with the information needed to formulate a result. "Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday."

The query then adds a clause that appears relevant, but actually isn't with regards to the final answer, noting that of the kiwis picked on Sunday, "five of them were a bit smaller than average." The answer requested simply asked "how many kiwis does Oliver have?"

The note about the size of some of the kiwis picked on Sunday should have no bearing on the total number of kiwis picked. However, OpenAI's model as well as Meta's Llama3-8b subtracted the five smaller kiwis from the total result.

The faulty logic was supported by a previous study from 2019 which could reliably confuse AI models by asking a question about the age of two previous Super Bowl quarterbacks. By adding in background and related information about the the games they played in, and a third person who was quarterback in another bowl game, the models produced incorrect answers.

"We found no evidence of formal reasoning in language models," the new study concluded. The behavior of LLMS "is better explained by sophisticated pattern matching" which the study found to be "so fragile, in fact, that [simply] changing names can alter results."



Read on AppleInsider

Muons-R-Us

Comments

  • Reply 1 of 11
    The primary issue with LLM computing is the ridiculously high power requirements. It goes against all of the low power hardware development of the last couple of decades.
    williamlondon
  • Reply 2 of 11
    hexclockhexclock Posts: 1,306member
    Of course they can’t reason. It’s not a living mind. It’s the illusion of intelligence. 
  • Reply 3 of 11
    For those that want more information about Apple's Machine Learning Research (including LLM's)


    https://machinelearning.apple.com/research
    williamlondon
  • Reply 4 of 11
    22july201322july2013 Posts: 3,711member
    hexclock said:
    Of course they can’t reason. It’s not a living mind. It’s the illusion of intelligence. 
    And of course those Boston Dynamics robot dogs can't run. It's not a living body. It's the illusion of running. Illusion, shmillusion. If it works, that's all I care about. Maybe you agree with me, you're just quibbling over a word.
    12Strangers
  • Reply 5 of 11
    The article lacks fact checking and details like when were the tests conducted either OpenAIs models and which model was used. When I perform the request I get the following answer from chatGPT 4o:

    Question: ” Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday. Of the kiwis picked on Sunday, five of them were a bit smaller than average. How many kiwis does Oliver have?”

    Answer: “ Let’s break this down:

    • On Friday, Oliver picks 44 kiwis.
    • On Saturday, he picks 58 kiwis.
    • On Sunday, he picks double the number of kiwis he did on Friday, so he picks  44 \times 2 = 88  kiwis.

    The total number of kiwis he picks is:

    44 \text{ (Friday)} + 58 \text{ (Saturday)} + 88 \text{ (Sunday)} = 190 \text{ kiwis.}

    So, Oliver has 190 kiwis in total. The fact that five of the kiwis picked on Sunday are smaller doesn’t affect the total number.”

    Perfect answer!
    williamlondon
  • Reply 6 of 11
    netroxnetrox Posts: 1,486member
    hexclock said:
    Of course they can’t reason. It’s not a living mind. It’s the illusion of intelligence. 
    You grossly underestimate the lack of reasoning in humans. 
    12Strangers
  • Reply 7 of 11
    hexclock said:
    Of course they can’t reason. It’s not a living mind. It’s the illusion of intelligence. 
    I’m legitimately interested in hearing the definition of intelligence from anyone who will offer it.
  • Reply 8 of 11
    LLMs aren’t sentient. They look for patterns in the query and then apply algorithms to those patterns to identify details that then are used to search databases or perform functions. LLMs can’t learn. If the data they search contains errors, they will report wrong answers. Essentially they are speech recognition engines paired with limited data retrieval and language generation capabilities.
    chasm
  • Reply 9 of 11
    quote: “"We found no evidence of formal reasoning in language models," the new study concluded. The behavior of LLMS "is better explained by sophisticated pattern matching" which the study found to be "so fragile, in fact, that [simply] changing names can alter results."”

    Obviously… AI does not send for ‘Artificial Intelligence’ but for… ‘Accelerated Idiotic.’ 
  • Reply 10 of 11
    iOSDevSWE said:
    The article lacks fact checking and details like when were the tests conducted either OpenAIs models and which model was used. When I perform the request I get the following answer from chatGPT 4o:

    Question: ” Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday. Of the kiwis picked on Sunday, five of them were a bit smaller than average. How many kiwis does Oliver have?”

    Answer: “ Let’s break this down:

    • On Friday, Oliver picks 44 kiwis.
    • On Saturday, he picks 58 kiwis.
    • On Sunday, he picks double the number of kiwis he did on Friday, so he picks  44 \times 2 = 88  kiwis.

    The total number of kiwis he picks is:

    44 \text{ (Friday)} + 58 \text{ (Saturday)} + 88 \text{ (Sunday)} = 190 \text{ kiwis.}

    So, Oliver has 190 kiwis in total. The fact that five of the kiwis picked on Sunday are smaller doesn’t affect the total number.”

    Perfect answer!
    The wrong answer comes from o1-mini, not from GPT-4o. No mention of GPT-4o in the article.
    chasm
  • Reply 11 of 11
    mfrydmfryd Posts: 222member
    hexclock said:
    Of course they can’t reason. It’s not a living mind. It’s the illusion of intelligence. 
    And of course those Boston Dynamics robot dogs can't run. It's not a living body. It's the illusion of running. Illusion, shmillusion. If it works, that's all I care about. Maybe you agree with me, you're just quibbling over a word.
    The difference between large language models and real intelligence is analogous to the difference between a Hollywood movie where it looks like someone bing shot, and an actual video of someone being shot.  On screen, they both may look the same.  But what's going on behind the scenes is very different.  Furthermore, if you extend the time, in the Hollywood movie you will see the "victim" get up, and wipe off the fake blood.  That's not what happens in a video of an actual shooting.

    While the large language models may give the appearance of intelligence, it's not what we normally think of as intelligence.

    Suppose I programmed a computer to beg and plead that it not be turned off.  Every time someone got close to the power switch, it would randomly play a prerecorded message begging that it doesn't want to die, that it's afraid of being turned off, that it hurts when it is turned off.   The computer would give the appearance of fear, but there would be no actual fear.   After all, the computer doesn't need to know what any of the recordings are.   You could keep the program the same, and replace the recordings with ones begging to be turned off as the computer is miserable with a pain in all the diodes down its left side.

    chasm
Sign In or Register to comment.