Apple Study Reveals Critical Flaws in AI's Logical Reasoning Abilities

MacRumors · Oct 14, 2024

Apple's AI research team has uncovered significant weaknesses in the reasoning abilities of large language models, according to a newly published study.

The study, published on arXiv, outlines Apple's evaluation of a range of leading language models, including those from OpenAI, Meta, and other prominent developers, to determine how well these models could handle mathematical reasoning tasks. The findings reveal that even slight changes in the phrasing of questions can cause major discrepancies in model performance that can undermine their reliability in scenarios requiring logical consistency.

Apple draws attention to a persistent problem in language models: their reliance on pattern matching rather than genuine logical reasoning. In several tests, the researchers demonstrated that adding irrelevant information to a question—details that should not affect the mathematical outcome—can lead to vastly different answers from the models.

One example given in the paper involves a simple math problem asking how many kiwis a person collected over several days. When irrelevant details about the size of some kiwis were introduced, models such as OpenAI's o1 and Meta's Llama incorrectly adjusted the final total, despite the extra information having no bearing on the solution.

We found no evidence of formal reasoning in language models. Their behavior is better explained by sophisticated pattern matching—so fragile, in fact, that changing names can alter results by ~10%.

This fragility in reasoning prompted the researchers to conclude that the models do not use real logic to solve problems but instead rely on sophisticated pattern recognition learned during training. They found that "simply changing names can alter results," a potentially troubling sign for the future of AI applications that require consistent, accurate reasoning in real-world contexts.

According to the study, all models tested, from smaller open-source versions like Llama to proprietary models like OpenAI's GPT-4o, showed significant performance degradation when faced with seemingly inconsequential variations in the input data. Apple suggests that AI might need to combine neural networks with traditional, symbol-based reasoning called neurosymbolic AI to obtain more accurate decision-making and problem-solving abilities.

Article Link: Apple Study Reveals Critical Flaws in AI's Logical Reasoning Abilities

ViDeOmAnCiNi · Oct 14, 2024

I wonder how @Grok would fair on their tests?

Timpetus · Oct 14, 2024

If this surprises you, you've been lied to. Next, figure out why they wanted you to think "AI" was actually thinking in a way qualitatively similar to humans. Was it just for money? Was it to scare you and make you easier to control?

johnediii · Oct 14, 2024

All you have to do to avoid the coming rise of the machines is change your name.

CarletonTorpin · Oct 14, 2024

Did they use the Preview-01 model from OpenAi? (Preview-01 is allowed more compute time so that it can produce better reasoning in its output.)

chriscjcj · Oct 14, 2024

Like the old trick question: how many three-cent stamps in a dozen?

Mitthrawnuruodo · Oct 14, 2024

This shows quite clearly that LLMs aren't "intelligent" in any reasonable sense of the word, they're just highly advanced at (speech/writing) pattern recognition.

Basically electronic parrots.

They can be highly useful, though. I've used Chat-GPT (4o with canvas and o1-preview) quite a lot for tweaking code examples to show in class, for instance.

fumi2014 · Oct 14, 2024

o1-preview

Photoshopper · Oct 14, 2024

Why has no one else reported this? It took the “newcomer” Apple to figure it out and to tell the truth?

Macalway · Oct 14, 2024

Well thanks for that Apple. Not to say others are not well aware of these limitations, and have been for years, your help is much appreciated

loui100 · Oct 14, 2024

Claude solved this on first try so this news is a big nothing burger. Funny that the study says "the majority of models fail to ignore these statements". So there were models that worked fine but they only cherry-picked the worst ones? Smells of bias.

Actually ChatGPT 4o also solved it on first go, so what the hell? I actually ran this multiple times with both models and it never came out wrong. Did they run the test 10000 times until the AI tripped?

Macalway · Oct 14, 2024

Photoshopper said:
Why has no one else reported this? It took the “newcomer” Apple to figure it out and to tell the truth?

you're kidding right?

anthony13 · Oct 14, 2024

I do find it odd that there’s a suggestion with ai that it is literally aware and ‘thinking’. I am not particularly educated on these things, but that seems impossible to me. Maybe someone here can explain it better than all these companies have tried to, but my impression is we’ve just reached a point where the processors are fast enough to access all this data that has been collected in a very efficient way and compare/collate.

flybass · Oct 14, 2024

Macalway said:
you're kidding right?

Yea.. like this is news to anyone. Guessing the research paper is just an excuse for taking so long to match competitors…

jaster2 · Oct 14, 2024

Apple should know how asking for something in different ways can skew results. Siri has been demonstrating that quite effectively for years.

CopyChief · Oct 14, 2024

I'm shocked this is shocking to anyone who has spent more than 10 seconds using any AI tools, yes, even the latest, most advanced models.

AI is amazing. But it's not "reasoning." It does not create nor find "meaning." The language models are, yes, modeling. They're damn good, but they're all about recognizing patterns and connections, and sometimes (often) making connections that aren't there. These appear as "logical" errors, but are not logic at all—they're just hallucintions, such as two things having the same name confusing the analysis.

roar08 · Oct 14, 2024

This isn't new information. The LLMs getting all the attention are using extremely large volumes of data to make highly advanced, educational guesses. This is also why these models "hallucinate". Responses just happen to be correct much of the time, but this is also why they sound correct when they're wrong.

Promostyle · Oct 14, 2024

jaster2 said:
Apple should know how asking for something in different ways can skew results. Siri has been demonstrating that quite effectively for years.

After 13 years of Siri, Apple needs to stfd and ****.

Natas1000 · Oct 14, 2024

The current "AI" systems biggest success so far is fooling people that it is AI and should be invested in.

Chuckeee · Oct 14, 2024

CopyChief said:
I'm shocked this is shocking to anyone who has spent more than 10 seconds using any AI tools, yes, even the latest, most advanced models.

Actually, it’s not a surprise at all. Most people appear to use AI, do not look at the responses. They just blindly accept what they’re given believing that it’s accurate. They try it three or four times they like what comes out and after that, they just take the response without critically, looking at it To determine if it makes any sense.

bradman83 · Oct 14, 2024

MacRumors said:
Apple draws attention to a persistent problem in language models: their reliance on pattern matching rather than genuine logical reasoning.

Some, but not all, tech writers have correctly pointed out that today's "AI" capabilities aren't actually AI in the sense that we have artificial logic and reasoning. It's simply the next generation of machine learning branded under a new buzzword. And ironically despite all of the moaning about Apple being behind on AI (which they are from an LLM perspective), Apple has a solid foundation of machine learning hardware and software capabilities to build on.

applezulu · Oct 14, 2024

Timpetus said:
If this surprises you, you've been lied to. Next, figure out why they wanted you to think "AI" was actually thinking in a way qualitatively similar to humans. Was it just for money? Was it to scare you and make you easier to control?

Much of it is just popular hype from people who don't know enough to know the difference. Think of the NY Times article that sort of kicked it all off in the popular media a couple of years ago. The writer seemed convinced that the AI was obsessing over him and actually asking him to leave his wife. The actual transcript for anyone who's seen this stuff back through the decades, showed the AI program bouncing off programmed parameters and being pushed by the writer into shallow territory where it lacked sufficient data to create logical interactions. The writer and most people reading it, however, thought the AI was being borderline sentient.

The simpler occam's razor explanation why AI businesses have rolled with that perception or at least haven't tried much to refute it, is that it provides cover for the LLM "learning" process that steals copyrighted intellectual property and then regurgitates it in whole or in collage form. The sheen of possible sentience clouds the theft ("people also learn by consuming the work of others") as well as the plagiarism ("people are influenced by the work of others, so what then constitutes originality?"). When it's made clear that LLM AI is merely hoovering, blending and regurgitating with no involvement of any sort of reasoning process, it becomes clear that the theft of intellectual property is just that: theft of intellectual property.

Lepton · Oct 14, 2024

johnediii said:
All you have to do to avoid the coming rise of the machines is change your name.

Careful, though. One little typo between Tuttle and Buttle, and bang! You’re in Brazil

DrRadon · Oct 14, 2024

Photoshopper said:
Why has no one else reported this? It took the “newcomer” Apple to figure it out and to tell the truth?

You seem to misunderstand the difference between assuming based on noticing something, a study and reporting.

This is not the first time someone noticed something and wrote about it. ChatGPT was outright lieing to me so many times it’s bonkers. It’s Funny to loosely bounce ideas but even when I try to streamline my text for LinkedIn it will change it in nonsensical ways. It’s not as great as the ex crypto and ex nft course sellers tell you it is.

ChedNasad · Oct 14, 2024

Anyone that uses an LLM on occasion should know this from experience. Nice to see some data though.

Apple Study Reveals Critical Flaws in AI's Logical Reasoning Abilities

macrumors bot

macrumors member

macrumors 6502a

macrumors regular

macrumors member

macrumors newbie

Moderator emeritus

macrumors 6502

macrumors regular

macrumors 601

macrumors member

macrumors 601

macrumors 65816

macrumors 6502

macrumors regular

macrumors 6502

Cancelled

macrumors regular

macrumors member

macrumors 68040

macrumors 65816

macrumors 6502

macrumors 6502a

macrumors 65816

macrumors regular

Our Staff