Apple's study proves that LLM-based AI models are flawed because they cannot reason

Posted:
in General Discussion

A new paper from Apple's artificial intelligence scientists has found that engines based on large language models, such as those from Meta and OpenAI, still lack basic reasoning skills.

Apple and others are developing artificial intelligence models with mixed results.
Apple plans to introduce its own version of AI starting with iOS 18.1 - image credit Apple



The group has proposed a new benchmark, GSM-Symbolic, to help others measure the reasoning capabilities of various large language models (LLMs). Their initial testing reveals that slight changes in the wording of queries can result in significantly different answers, undermining the reliability of the models.

The group investigated the "fragility" of mathematical reasoning by adding contextual information to their queries that a human could understand, but which should not affect the fundamental mathematics of the solution. This resulted in varying answers, which shouldn't happen.

"Specifically, the performance of all models declines [even] when only the numerical values in the question are altered in the GSM-Symbolic benchmark," the group wrote in their report. "Furthermore, the fragility of mathematical reasoning in these models [demonstrates] that their performance significantly deteriorates as the number of clauses in a question increases."

The study found that adding even a single sentence that appears to offer relevant information to a given math question can reduce the accuracy of the final answer by up to 65 percent. "There is just no way you can build reliable agents on this foundation, where changing a word or two in irrelevant ways or adding a few bit of irrelevant info can give you a different answer," the study concluded.

An absence of critical thinking



A particular example that illustrates the issue was a math problem that required genuine understanding of the question. The task the team developed, called "GSM-NoOp" was similar to the kind of mathematic "word problems" an elementary student might encounter.

The query started with the information needed to formulate a result. "Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday."

The query then adds a clause that appears relevant, but actually isn't with regards to the final answer, noting that of the kiwis picked on Sunday, "five of them were a bit smaller than average." The answer requested simply asked "how many kiwis does Oliver have?"

The note about the size of some of the kiwis picked on Sunday should have no bearing on the total number of kiwis picked. However, OpenAI's model as well as Meta's Llama3-8b subtracted the five smaller kiwis from the total result.

The faulty logic was supported by a previous study from 2019 which could reliably confuse AI models by asking a question about the age of two previous Super Bowl quarterbacks. By adding in background and related information about the the games they played in, and a third person who was quarterback in another bowl game, the models produced incorrect answers.

"We found no evidence of formal reasoning in language models," the new study concluded. The behavior of LLMS "is better explained by sophisticated pattern matching" which the study found to be "so fragile, in fact, that [simply] changing names can alter results."



Read on AppleInsider

Muons-R-Usbaconstangmattinoz
«13

Comments

  • Reply 1 of 42
    The primary issue with LLM computing is the ridiculously high power requirements. It goes against all of the low power hardware development of the last couple of decades.
    williamlondonOferbaconstangbloggerblogwatto_cobra
  • Reply 2 of 42
    hexclockhexclock Posts: 1,307member
    Of course they can’t reason. It’s not a living mind. It’s the illusion of intelligence. 
    OferbaconstangbloggerblogStrangeDaysjas99argonautwatto_cobra
  • Reply 3 of 42
    For those that want more information about Apple's Machine Learning Research (including LLM's)


    https://machinelearning.apple.com/research
    williamlondonOferBart YbyronlScot1argonautwatto_cobra
  • Reply 4 of 42
    22july201322july2013 Posts: 3,722member
    hexclock said:
    Of course they can’t reason. It’s not a living mind. It’s the illusion of intelligence. 
    And of course those Boston Dynamics robot dogs can't run. It's not a living body. It's the illusion of running. Illusion, shmillusion. If it works, that's all I care about. Maybe you agree with me, you're just quibbling over a word.
    12StrangersBart Ybyronlbluefire1blastdoor
  • Reply 5 of 42
    The article lacks fact checking and details like when were the tests conducted either OpenAIs models and which model was used. When I perform the request I get the following answer from chatGPT 4o:

    Question: ” Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday. Of the kiwis picked on Sunday, five of them were a bit smaller than average. How many kiwis does Oliver have?”

    Answer: “ Let’s break this down:

    • On Friday, Oliver picks 44 kiwis.
    • On Saturday, he picks 58 kiwis.
    • On Sunday, he picks double the number of kiwis he did on Friday, so he picks  44 \times 2 = 88  kiwis.

    The total number of kiwis he picks is:

    44 \text{ (Friday)} + 58 \text{ (Saturday)} + 88 \text{ (Sunday)} = 190 \text{ kiwis.}

    So, Oliver has 190 kiwis in total. The fact that five of the kiwis picked on Sunday are smaller doesn’t affect the total number.”

    Perfect answer!
    williamlondonOfergatorguymuthuk_vanalingambyronlargonaut
  • Reply 6 of 42
    netroxnetrox Posts: 1,496member
    hexclock said:
    Of course they can’t reason. It’s not a living mind. It’s the illusion of intelligence. 
    You grossly underestimate the lack of reasoning in humans. 
    12Strangerszeus423OferScot1ramanpfaffForumPostblastdoorchiajas99argonaut
  • Reply 7 of 42
    hexclock said:
    Of course they can’t reason. It’s not a living mind. It’s the illusion of intelligence. 
    I’m legitimately interested in hearing the definition of intelligence from anyone who will offer it.
    Ofer
  • Reply 8 of 42
    LLMs aren’t sentient. They look for patterns in the query and then apply algorithms to those patterns to identify details that then are used to search databases or perform functions. LLMs can’t learn. If the data they search contains errors, they will report wrong answers. Essentially they are speech recognition engines paired with limited data retrieval and language generation capabilities.
    chasmOferbaconstangcommand_fStrangeDaysjas99argonautwatto_cobra
  • Reply 9 of 42
    quote: “"We found no evidence of formal reasoning in language models," the new study concluded. The behavior of LLMS "is better explained by sophisticated pattern matching" which the study found to be "so fragile, in fact, that [simply] changing names can alter results."”

    Obviously… AI does not send for ‘Artificial Intelligence’ but for… ‘Accelerated Idiotic.’ 
    jas99argonautwatto_cobra
  • Reply 10 of 42
    iOSDevSWE said:
    The article lacks fact checking and details like when were the tests conducted either OpenAIs models and which model was used. When I perform the request I get the following answer from chatGPT 4o:

    Question: ” Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday. Of the kiwis picked on Sunday, five of them were a bit smaller than average. How many kiwis does Oliver have?”

    Answer: “ Let’s break this down:

    • On Friday, Oliver picks 44 kiwis.
    • On Saturday, he picks 58 kiwis.
    • On Sunday, he picks double the number of kiwis he did on Friday, so he picks  44 \times 2 = 88  kiwis.

    The total number of kiwis he picks is:

    44 \text{ (Friday)} + 58 \text{ (Saturday)} + 88 \text{ (Sunday)} = 190 \text{ kiwis.}

    So, Oliver has 190 kiwis in total. The fact that five of the kiwis picked on Sunday are smaller doesn’t affect the total number.”

    Perfect answer!
    The wrong answer comes from o1-mini, not from GPT-4o. No mention of GPT-4o in the article.
    chasmOfercommand_fwatto_cobra
  • Reply 11 of 42
    mfrydmfryd Posts: 223member
    hexclock said:
    Of course they can’t reason. It’s not a living mind. It’s the illusion of intelligence. 
    And of course those Boston Dynamics robot dogs can't run. It's not a living body. It's the illusion of running. Illusion, shmillusion. If it works, that's all I care about. Maybe you agree with me, you're just quibbling over a word.
    The difference between large language models and real intelligence is analogous to the difference between a Hollywood movie where it looks like someone bing shot, and an actual video of someone being shot.  On screen, they both may look the same.  But what's going on behind the scenes is very different.  Furthermore, if you extend the time, in the Hollywood movie you will see the "victim" get up, and wipe off the fake blood.  That's not what happens in a video of an actual shooting.

    While the large language models may give the appearance of intelligence, it's not what we normally think of as intelligence.

    Suppose I programmed a computer to beg and plead that it not be turned off.  Every time someone got close to the power switch, it would randomly play a prerecorded message begging that it doesn't want to die, that it's afraid of being turned off, that it hurts when it is turned off.   The computer would give the appearance of fear, but there would be no actual fear.   After all, the computer doesn't need to know what any of the recordings are.   You could keep the program the same, and replace the recordings with ones begging to be turned off as the computer is miserable with a pain in all the diodes down its left side.

    chasmOferbaconstangStrangeDaysjas99argonautwatto_cobra
  • Reply 12 of 42
    MarvinMarvin Posts: 15,479moderator
    LLMs aren’t sentient. They look for patterns in the query and then apply algorithms to those patterns to identify details that then are used to search databases or perform functions. LLMs can’t learn. If the data they search contains errors, they will report wrong answers. Essentially they are speech recognition engines paired with limited data retrieval and language generation capabilities.
    Apart from not being able to learn in real-time, this describes what people do too. At any given point in time without new information, the training available to an AI is of a similar nature to a person.

    Reasoning skills don't necessarily require real-time learning, that can be another pre-trained model (or code) that reformats queries before the LLM processes them.

    The paper suggests moving beyond pattern matching to achieve this but understanding varied queries is still pattern matching.

    The image generators have the same problem where very small changes in tokens can produce very different outputs, which makes it difficult to use it for artwork that uses the same designs like an illustrated book because the same character on each page looks different.


    This can be improved on using a control net which places constraints on the generation process. The video generators need to be stable from one frame to another and there's a recent video showing an old video game converted to photoreal video:


    For understanding language queries, people understand that a phrase like 'girls college' has a different meaning from 'college girls' because of training on word association, not through any mystical reasoning capability.

    Apple's paper doesn't define what they mean by formal reasoning and state that it differs from probabilistic pattern-matching. We know that brains are made of connections of neurons, around 100 trillion connections in some kind of structure and AI is trying to reverse-engineer what the structure of a brain is doing.

    To recreate what a brain is requires massive computational power and data, well beyond personal computer performance. Server clusters can get closer but getting the right models that work well in every scenario is going to take some trial and error. Humans have had a 50,000+ year head start:


    Modern AI is doing pretty well for being under 8 years old, certainly more capable than a human 8 year old.

    The main things that an AI lacks vs a human are survival instinct, motivations and massive real-time data input and processing, the rest can be simulated with patterns and algorithms and some of the former can be too. Some of the discussions around AI border on religious arguments in assuming there's a limit to how well a machine can simulate a human but there would be no assumption like this if an AI was to simulate a more primitive mammal, which humans evolved from.
    watto_cobra
  • Reply 13 of 42
    iOSDevSWE said:
    The article lacks fact checking and details like when were the tests conducted either OpenAIs models and which model was used. When I perform the request I get the following answer from chatGPT 4o:

    Question: ” Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday. Of the kiwis picked on Sunday, five of them were a bit smaller than average. How many kiwis does Oliver have?”

    Answer: “ Let’s break this down:

    • On Friday, Oliver picks 44 kiwis.
    • On Saturday, he picks 58 kiwis.
    • On Sunday, he picks double the number of kiwis he did on Friday, so he picks  44 \times 2 = 88  kiwis.

    The total number of kiwis he picks is:

    44 \text{ (Friday)} + 58 \text{ (Saturday)} + 88 \text{ (Sunday)} = 190 \text{ kiwis.}

    So, Oliver has 190 kiwis in total. The fact that five of the kiwis picked on Sunday are smaller doesn’t affect the total number.”

    Perfect answer!
    I wont elaborate about our own research on LLVM models and I'm not criticizing them as they already provide great help in day to day usages. So, sorry for the strong believers in this forum and everywhere else, but I confirm that LLVM models are just big statistical models as dumb as could be a sparse matrix with 300 billions columns.

    No need to go too far in scientific research and analysis to observe this fact. We asked chat gpt 4o, the full omniscient and omnipotent version according to OpenAI and Saint Altman, a very simple question that requires both logical reasoning and real language knowledge… Here's the answer:

    "Sure! Here are seven popular expressions with the first word having 7 letters and the second word having 5 letters:

    1. Lucky break
    2. Future plans
    3. Secret code
    4. Change form
    5. Digital age
    6. Genuine love
    7. Creative mind

    If you need more or something different, just let me know!"

    For sure we will let it know…

    baconstangcommand_fwilliamlondonraulcristianjas99watto_cobra
  • Reply 14 of 42
    eriamjheriamjh Posts: 1,752member
    Oliver doesn’t have any kiwis.  He’s just a worker.  He doesn’t get to keep any kiwis.  If he did, he’d be accused of theft and fired.   

    So the correct answer is zero.  
    baconstangmattinozilarynxchasmwilliamlondonwatto_cobra
  • Reply 15 of 42
    baconstangbaconstang Posts: 1,151member
    Just because you can steer, doesn't mean you can drive...
    command_fwatto_cobra
  • Reply 16 of 42
    It can't be bargained with, it can't be reasoned with, it doesn't feel pity or remorse or fear, and it absolutely will not stop… EVER. :)
    regurgitatedcoprolitewilliamlondoneriamjh12StrangersblastdoorStrangeDayswatto_cobra
  • Reply 17 of 42
    byronlbyronl Posts: 376member
    Did they test OpenAI's o1 models, specifically meant to be good at reasoning?
    williamlondonwatto_cobra
  • Reply 18 of 42
    chasmchasm Posts: 3,568member
    byronl said:
    Did they test OpenAI's o1 models, specifically meant to be good at reasoning?
    You might try reading the article to find out the answer to your question. It is in there.
    williamlondonjas99watto_cobra
  • Reply 19 of 42
    anonymouseanonymouse Posts: 6,965member
    chasm said:
    byronl said:
    Did they test OpenAI's o1 models, specifically meant to be good at reasoning?
    You might try reading the article to find out the answer to your question. It is in there.
    See, this is why we need artificial intelligence.
    williamlondonjas99watto_cobra
  • Reply 20 of 42
    In New Zealand we call them kiwifruit. Kiwis can refer to the bird and a New Zealander. So reading this I’m thinking, is Oliver picking people on Friday, birds on Saturday and fruit on Sunday? Or any combination of those three choices. My point being, cultural/linguistic differences and word definitions where a word can refer to different things, even in the same sentence. Just thought of one… “a bolt of lightning bolts bolted out the unbolted door”.
    jellybellywatto_cobra
Sign In or Register to comment.