Apple Study Reveals Critical Flaws in AI's Logical Reasoning Abilities

kirbyrun · Oct 14, 2024

loui100 said:
Did they run the test 10000 times until the AI tripped?

Given that millions and millions of people will run AI queries, it seems fair to test it repeatedly to see if it can give a proper answer each time.

aj8690 · Oct 14, 2024

CarletonTorpin said:
Did they use the Preview-01 model from OpenAi? (Preview-01 is allowed more compute time so that it can produce better reasoning in its output.)

4th paragraph of the article says they used o1.

Lepton · Oct 14, 2024

Perhaps this is mostly a hint as to the direction they want to go in trying to get insightful and reliable responses in Siri etc - neurosymbolic AI

Orange Bat · Oct 14, 2024

I came to the “AI” game a little late but have been using it the past month in my software development. It can be absolutely fantastic for simple things, but it often just flat out fails on complex questions. And, yes, the AI can give dramatically different answers with a slight change in wording. As an “assistant”, it’s pretty great most of the time, but I wouldn’t rely on an AI for anything critical.

williamkey123 · Oct 14, 2024

I agree its consciousness level seems low, but these kinds of things are also things that kids get wrong too. When my kid was 4, she was super into math problems, and she’d ask me to give her problems like this all the time for fun. She’d make these exact kinds of mistakes. I’d add irrelevant information to see if she’d inadvertently incorporate it - for example, I’d ask, I went to the store and bought three apples, two sausages, and two bananas, how many fruits do I have? She'd often get questions like this wrong. So I’m not sure this necessarily proves anything other than it’s reasoning isn’t incredibly advanced.

Devyn89 · Oct 14, 2024

Guess that kind of answers why AI sometimes thinks bananas has like 4 n’s

citysnaps · Oct 14, 2024

I wonder how many here have actually taken the time to read, absorb, grasp, and understand the study?

From some of the responses, I'm guessing one or two. Maybe.

User 6502 · Oct 14, 2024

I suspect these problems will become less and less relevant as the models improve. Also nobody really understand what goes on in our brains, so it could well be we are also very good pattern finders and that’s all our ‘reasoning’ is.

buckwheet · Oct 14, 2024

loui100 said:
Claude solved this on first try so this news is a big nothing burger. Funny that the study says "the majority of models fail to ignore these statements". So there were models that worked fine but they only cherry-picked the worst ones? Smells of bias.

Actually ChatGPT 4o also solved it on first go, so what the hell? I actually ran this multiple times with both models and it never came out wrong. Did they run the test 10000 times until the AI tripped?
View attachment 2437141

Yeah, this is exactly how significance testing works. I haven't read the paper, but it is assumed that they would have tested a large number of queries, a large number of times, since that's how you approximate to the true result (law of large numbers). A published, peer reviewed study should also find statistical significance, meaning that it is extremely unlikely that chance alone could explain the result. I'd imagine they also tested different contexts surrounding the question itself (though maybe not), since a model that gets is right in isolation, but fails when there is additional—tangentially related, or unrelated—content in the context window, is clearly not using reason (which would remain consistent regardless of long-term context).

turbineseaplane · Oct 14, 2024

One of a zillions reasons most of us see AI as "the next crypto"

Just being pushed on everyone as SV and the VCs are incredibly desperate for a "next big thing"

attila · Oct 14, 2024

loui100 said:
Claude solved this on first try so this news is a big nothing burger. Funny that the study says "the majority of models fail to ignore these statements". So there were models that worked fine but they only cherry-picked the worst ones? Smells of bias.

Actually ChatGPT 4o also solved it on first go, so what the hell? I actually ran this multiple times with both models and it never came out wrong. Did they run the test 10000 times until the AI tripped?
View attachment 2437141

Does this question exist elsewhere on the Internet? Yes it does. The conclusion is indexed.

LLMs absolutely cannot count. They can't do basic math because that's not what the algorithm is built for. They can do excellent pattern matching.

ProbablyDylan · Oct 14, 2024

Can't wait to see people bring this as an example of AI lacking utility. They're Large Language Models for a reason.

This is more of an expectations issue than a tech issue.

CorvusFox · Oct 14, 2024

LLMs have no reasoning ability. They're literally just next word guessing algorithms fed on the stolen works of others, and worse, public forum posts. There's no intelligence at work, at all.

ItWasNotMe · Oct 14, 2024

I tried ChatGPT with the following question "why is 9 time 6 equals 42 a problem" and got the following result

"Saying that 9 times 6 equals 42 is a problem because it’s incorrect. The correct product of 9 and 6 is 54. Miscalculating can lead to errors in various contexts, such as math problems, financial calculations, or measurements. Accuracy is key in mathematics!:

When it says it's incorrect, that's not necessarily true, same for the answer being 54, it's not necessarily true. It's correct when it says it accuracy is key but so is completeness.

It made a fundamental assumption that I had fully specified the problem and was working in decimal, which was why it unequivocally said it was incorrect.

I wasn't I was using base 13 (where 13 is in decimal notation)

Decimal version: 9 (Decimal) X 6 (Decimal) = 54 (Decimal) , i.e. 5 X base + 4 -> 5X10 +4

Base 13 version: 9 (Base 13) X 6 (Base 13) = 42 (Base 13), i.e. 4 X base + 2 -> 4 X 13 +2

That the ultimate question is not expressed in decimal is why the universe is a strange and wondrous place.

jdawgnoonan · Oct 14, 2024

Timpetus said:
If this surprises you, you've been lied to. Next, figure out why they wanted you to think "AI" was actually thinking in a way qualitatively similar to humans. Was it just for money? Was it to scare you and make you easier to control?

AI has mainly been the always shady tech industry trying to cash in on a new big thing. It is not like it is actually hard for large software makers to convince media and politicians that something is huge and deserves massive investment. Add in the threat that other countries (China, etc...) may beat the US to the new technology and you have the makings of a true bonanza.

840quadra · Oct 14, 2024

CorvusFox said:
LLMs have no reasoning ability. They're literally just next word guessing algorithms fed on the stolen works of others, and worse, public forum posts. There's no intelligence at work, at all.

It is the more modern version of early (and sometimes) modern pages on Wikipedia where people "source" something that has zero factual basis.

If the source of the data is garbage, the answer to the equation will be equally useless.

antiprotest · Oct 14, 2024

That's fine and all, but sounds like sour grapes coming from Apple, who still has NOTHING. In what they offer users, they are years behind even freaking Grammarly.

VisceralRealist · Oct 14, 2024

It's funny to me that we act like if AI isn't "sentient" then that means it's complete BS. Why not just recognize what it is good at and stop with the unnecessary hype?

macduke · Oct 14, 2024

So Apple is proving Apple Intelligence is crap before it even releases? Funny.

But in all seriousness, I've had a lot of luck with the ChatGPT o1-preview model in particular. It is definitely a step up. What I like is that it breaks things down into parts and then goes through each part to show how it arrived at that conclusion. Then it's pretty easy for me to follow since it breaks down something complicated into something simple and I can check the work for each step and ensure it's not doing something weird.

It also nailed a fairly complicated bit of programming for me the other day in an emergency situation and knew all of the proper API calls to make to an external service we use. Allowed me to react several times faster to the issue by rapidly building out a tool to help resolve an issue across 800+ websites. Checked all the code, modified a couple bits to better suit the need, tested it on a few sites, and then mass deployed. Everything went great, but it's still a tool. It's good to understand the limitations of your tools, and some tools are more effective in the hands of one person vs. another, based on experience and skill level. I also do woodworking, and it's even more apparent in that field. Never rely too heavily on your tools, or on any one tool.

gwhizkids · Oct 14, 2024

ChedNasad said:
Anyone that uses an LLM on occasion should know this from experience. Nice to see some data though.

That’s the differentiator. Anyone truly following this industry knew that intuitively (or from experience). But Apple actually assimilated the actual data. So let’s not bash them too hard for this.

endemize · Oct 14, 2024

Timpetus said:
If this surprises you, you've been lied to. Next, figure out why they wanted you to think "AI" was actually thinking in a way qualitatively similar to humans. Was it just for money? Was it to scare you and make you easier to control?

It seems you might be the one with the skewed perspective. Do you use it very often?

turbineseaplane · Oct 14, 2024

VisceralRealist said:
It's funny to me that we act like if AI isn't "sentient" then that means it's complete BS. Why not just recognize what it is good at and stop with the unnecessary hype?

Because the money pushing it wants it be some HUGE ASTRONOMICAL PAYDAY

...not just the occasionally and contextually possibly useful small side feature it actually is

We live in a BS Hype world the likes of which have never before been seen

I hope a reversion to some modesty, honesty, hopeful vision and truth may return, but right now it just feels like everyone is trying to see what they can get away with in the name of "cashing out"

I mean even just look at Apple

Is anything they are doing really expressing a coherent and hopeful vision for the future and where we want to go as people and customers?

No -- it's just a huge rent extraction operation to juice the stock price

Their one "thing" of late (AVP) is an overpriced, underwhelming, hugely isolating and anti-social helmet that sales would indicate nobody even wants.

upandown · Oct 14, 2024

After multiple years of hype this seems like a rather glaring issue that would have been exposed long ago by even at home testers. Why are we just hearing about it now?

DownUnderDan · Oct 14, 2024

I don't know. Before I retired this year, if I could have got that level of consistency from the executive team, I would have been a happy camper.

MGrayson3 · Oct 14, 2024

The problem with showing that some other model gets the correct answer is the same as we've always had: how do we know that we're getting a good answer? (Of course this isn't limited to AI) Opinions are everywhere. Curation is not. When we believe an answer, it's because of its reasonableness, the confidence we have in the answerer, and the supporting data provided (and its provenance). Have you ever read the studies quoted at the ends of articles about a drug's or an air purifier's effectiveness? About 40% of the time, the quoted study states something irrelevant or even opposite.

(And why should you believe me? 😉)

Apple Study Reveals Critical Flaws in AI's Logical Reasoning Abilities

macrumors 6502

macrumors regular

macrumors 6502a

macrumors 65816

macrumors member

macrumors 65816

macrumors G5

macrumors 65816

macrumors 6502

Contributor

macrumors 6502a

macrumors 68030

macrumors member

macrumors 6502

macrumors 6502a

Moderator

macrumors 601

macrumors 6502a

macrumors G5

macrumors G5

macrumors regular

Contributor

macrumors 65816

macrumors 6502a

macrumors regular

Our Staff