Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

MacRumors

macrumors bot
Original poster
Apr 12, 2001
68,947
39,945


In what may not come as much of a surprise, a new test of Siri's knowledge of Super Bowl history has revealed significant accuracy issues with Apple's virtual assistant, suggesting Apple still has some way to go in overcoming challenges with Siri's ability to provide reliable information.

Should-Apple-Kill-Siri-Feature.jpg

In a methodical experiment, One Foot Tsunami's Paul Kafasis asked Siri who won each Super Bowl from I through LX and documented its responses. The results were strikingly poor, with Siri correctly identifying winners only 34% of the time – just 20 correct answers out of 58 played Super Bowls.

Perhaps most notably, Siri repeatedly and incorrectly credited the Philadelphia Eagles with 33 Super Bowl victories, despite the team having won only one championship in their history. The virtual assistant's responses ranged from providing information about wrong Super Bowls to offering completely unrelated football facts.

While Siri did manage a few streaks of accurate answers, including three consecutive correct responses for Super Bowls V through VII, it also had a remarkable string of 15 consecutive incorrect answers spanning Super Bowls XVII through XXXII.

In one telling instance, when asked about Super Bowl XVI, Siri offered to defer to ChatGPT - which then provided the correct answer. The contrast highlighted the limitations of Siri's own knowledge base compared to more advanced AI systems.

The test was conducted on iOS 18.2.1 with Apple Intelligence enabled, and similar results were found on both the upcoming iOS 18.3 beta and macOS 14.7.2, suggesting the issue extends across Apple's platforms. Kafasis generated a spreadsheet of the results in both Excel and PDF formats, which you can read here.

Separately, inspired by Kafasis' test, Daring Fireball's John Gruber tried some of his own sports queries with Siri and compared its responses to ChatGPT, Kagi, DuckDuckGo, and Google, all of which succeeded where Siri failed.

Perhaps worse for Apple, Gruber found that old Siri (i.e. before Apple Intelligence) did a better job at answering a question by declining to answer it, instead providing a list of web links. The first web result provided an accurate, if only partial, answer to the question, whereas new Siri, powered by Apple Intelligence, fared much worse. Gruber explains:
New Siri — powered by Apple Intelligence™ with ChatGPT integration enabled — gets the answer completely but plausibly wrong, which is the worst way to get it wrong. It's also inconsistently wrong — I tried the same question four times, and got a different answer, all of them wrong, each time. It's a complete failure.
"It's just incredible how stupid Siri is about a subject matter of such popularity," commented Gruber. "If you had guessed that Siri could get half the Super Bowls right, you lost, and it wasn't even that close."

Of course, this isn't the first time Siri has received heavy flak for its all-round performance, but Gruber's criticism about "plausibly wrong" answers to general knowledge questions ties back to the modern problem of hallucinating AI chatbots that spout misleading or flat-out wrong responses with complete confidence.

Apple is developing a much smarter version of Siri that utilizes advanced large language models, which should allow the personal assistant to better compete with chatbots like ChatGPT. A chatbot version of Siri would likely be able to hold ongoing conversations and provide the sort of help and insight as ChatGPT or Claude, but how well the integration will perform may be a concern, going on Siri's abysmal track record.

Apple is expected to announce LLM Siri as soon as 2025 at WWDC, but Apple won't launch it until several months after it's unveiled. That means LLM Siri would come in an update to iOS 19, with Apple planning for a spring 2026 launch.

Article Link: Siri Gives Eagles 33 False Super Bowl Wins in Basic Knowledge Test
 
Siri has been and always will be useless for anything other than simple things like setting timers. Apple has not written good software in years. The fact that their platforms are still mostly usable speaks to how far ahead they were a decade+ ago.
 
Siri has been and always will be useless for anything other than simple things like setting timers. Apple has not written good software in years. The fact that their platforms are still mostly usable speaks to how far ahead they were a decade+ ago.

This isn't about Apple as such. The entire idea behind this is entirely flawed and they are being dragged into the hype and relying on praying that i'll eventually work, which it won't. Every company in the LLM space is doing the same thing. It's an arms race based on swimming in excrement with the promise of a cake at the end (the cake is a lie) and the end game is drowning.

Your second point I agree with. Their stuff actually works and has done for a long time. It's done. Finished. It's not that they haven't written good software in years, it's just they did write software so good it's done (bar the odd bug here and there). The next step is of course to change it because they have nothing else to do and ruin it (photos app anyone?)
 
This is interesting but Siri has always (before the "new Siri") been a basic personal assistant and not a source of general knowledge. The steps towards being more of a generalist tool are just baby steps.

These results are bad but Siri also really wasn’t designed to do things like this. It’s like asking your executive assistant to perform heart surgery. Without the proper training, you’re just asking for failure. What the new Siri is the old Siri with a little GPT thrown in.

Siri clearly needs to improve. This initial GPT integration is on its way to being trained to answer questions like this, but this shows that these beta features have a long way to go.

Edit: Check out Kramerjn's comment about the results being affected by how Siri is asked: https://forums.macrumors.com/thread...in-basic-knowledge-test.2448022/post-33695052

I also did some quick tests and verified that how you ask matters: https://forums.macrumors.com/thread...in-basic-knowledge-test.2448022/post-33695203

You can get the correct answers if you ask differently.
 
Last edited:
wut?

you can install and run countless non-Apple AI assistants on iPhones, iPads, and Macs
For now. Apple's AI is not up to speed. But when they are, these others are competition and that violates Apple App Store rules. Plus these others do not protect the children the way Apple's AI supposedly will.
 
This isn't a Maps level fiasco, but it's not too far off.

Remember when Apple's selling point was "they're not first, but when they do something, they do it right"? Those were the days.

We've got a confluence of factors here:
  • An industry-wide fixation on what is in many ways not very good technology (and an insistence on shoehorning it in everywhere).
  • Apple having a particular weakness in this particular area, dating back to well before the machine learning age and showing no signs of improvement.
  • FOMO on Apple's part - fear of Android/Samsung eating their lunch if they can't say they also do this stuff.
So we've got a technology where even the best implementations are pretty bad, and Apple's implementation is worse.
 
Not surprising. Apple really were caught with their pants down on LLMs when ChatGPT launched. It sounds like Google at least had something working in their labs, which is why they could rush out Bard Gemini a few months later.

If Apple launches something this terrible in the next 12 months, it is going to be usurped quickly by ChatGPT, CoPilot & Gemini in terms of iPhones users preferring those services over Apple Intelligence and it will takes years to convince people Siri is not terrible, even if it technically catches up.
 
This isn't a Maps level fiasco, but it's not too far off.

Remember when Apple's selling point was "they're not first, but when they do something, they do it right"? Those were the days.

We've got a confluence of factors here:
  • An industry-wide fixation on what is in many ways not very good technology (and an insistence on shoehorning it in everywhere).
  • Apple having a particular weakness in this particular area, dating back to well before the machine learning age and showing no signs of improvement.
  • FOMO on Apple's part - fear of Android/Samsung eating their lunch if they can't say they also do this stuff.
So we've got a technology where even the best implementations are pretty bad, and Apple's implementation is worse.
Exactly this. I know it's a trope on here, but this is exactly when a Steve Jobs-like personality is really needed. Famously, he wasn't interested in pandering to Wall Street, insisting that Apple would follow it's own path and wouldn't pay dividends / buy back shares, instead, relying on a constant pipeline of amazing products to grow the company and stock price.

Under Tim Cook, Apple became much more led by its investors (look at the value of dividends / buy backs over the last 10 years) and this whole AI/LLM rush to not just catch-up, but publicly announce their plans in this area almost a year in advance (context aware Siri was announced last June, but won't be available until iOS18.4 around April/May) screams that they are messaging to investors that they are following the industry trend, rather than taking that trend and creating a new trend with it as they used to do.
 
This isn't about Apple as such. The entire idea behind this is entirely flawed and they are being dragged into the hype and relying on praying that i'll eventually work, which it won't. Every company in the LLM space is doing the same thing. It's an arms race based on swimming in excrement with the promise of a cake at the end (the cake is a lie) and the end game is drowning.
As the original post states, ChatGPT, Kagi, DuckDuckGo and Google are all capable of giving correct answers, for some reason only Siri isn’t… Siri with ChatGPT support is somehow worse than ChatGPT on its own
 
This is interesting but Siri has always been a basic personal assistant and not a source of general knowledge. The steps towards being more of a generalist tool are just baby steps.

These results are bad but Siri also wasn’t designed to do things like this. It’s like asking your executive assistant to perform heart surgery. Without the proper training, you’re just asking for failure.

Siri clearly needs to improve but it’s also important to recognize that this test is asking Siri to do things it wasn’t trained to do.
iOS 6 > WWDC 2012

"The first thing that Siri has learnt in the last 8 months, is all about sports"

 


In what may not come as much of a surprise, a new test of Siri's knowledge of Super Bowl history has revealed significant accuracy issues with Apple's virtual assistant, suggesting Apple still has some way to go in overcoming challenges with Siri's ability to provide reliable information.

Should-Apple-Kill-Siri-Feature.jpg

In a methodical experiment, One Foot Tsunami's Paul Kafasis asked Siri who won each Super Bowl from I through LX and documented its responses. The results were strikingly poor, with Siri correctly identifying winners only 34% of the time – just 20 correct answers out of 58 played Super Bowls.

Perhaps most notably, Siri repeatedly and incorrectly credited the Philadelphia Eagles with 33 Super Bowl victories, despite the team having won only one championship in their history. The virtual assistant's responses ranged from providing information about wrong Super Bowls to offering completely unrelated football facts.

While Siri did manage a few streaks of accurate answers, including three consecutive correct responses for Super Bowls V through VII, it also had a remarkable string of 15 consecutive incorrect answers spanning Super Bowls XVII through XXXII.

In one telling instance, when asked about Super Bowl XVI, Siri offered to defer to ChatGPT - which then provided the correct answer. The contrast highlighted the limitations of Siri's own knowledge base compared to more advanced AI systems.

The test was conducted on iOS 18.2.1 with Apple Intelligence enabled, and similar results were found on both the upcoming iOS 18.3 beta and macOS 14.7.2, suggesting the issue extends across Apple's platforms. Kafasis generated a spreadsheet of the results in both Excel and PDF formats, which you can read here.

Separately, inspired by Kafasis' test, Daring Fireball's John Gruber tried some of his own sports queries with Siri and compared its responses to ChatGPT, Kagi, DuckDuckGo, and Google, all of which succeeded where Siri failed.

Perhaps worse for Apple, Gruber found that old Siri (i.e. before Apple Intelligence) did a better job at answering a question by declining to answer it, instead providing a list of web links. The first web result provided an accurate, if only partial, answer to the question, whereas new Siri, powered by Apple Intelligence, fared much worse. Gruber explains:
"It's just incredible how stupid Siri is about a subject matter of such popularity," commented Gruber. "If you had guessed that Siri could get half the Super Bowls right, you lost, and it wasn't even that close."

Of course, this isn't the first time Siri has received heavy flak for its all-round performance, but Gruber's criticism about "plausibly wrong" answers to general knowledge questions ties back to the modern problem of hallucinating AI chatbots that spout misleading or flat-out wrong responses with complete confidence.

Apple is developing a much smarter version of Siri that utilizes advanced large language models, which should allow the personal assistant to better compete with chatbots like ChatGPT. A chatbot version of Siri would likely be able to hold ongoing conversations and provide the sort of help and insight as ChatGPT or Claude, but how well the integration will perform may be a concern, going on Siri's abysmal track record.

Apple is expected to announce LLM Siri as soon as 2025 at WWDC, but Apple won't launch it until several months after it's unveiled. That means LLM Siri would come in an update to iOS 19, with Apple planning for a spring 2026 launch.

Article Link: Siri Gives Eagles 33 False Super Bowl Wins in Basic Knowledge Test
The only times Siri shines is when you compare it to Bixby.
 
Siri has been and always will be useless for anything other than simple things like setting timers. Apple has not written good software in years. The fact that their platforms are still mostly usable speaks to how far ahead they were a decade+ ago.
Well timers, and half the time Siri also manages to unlock my front door. Other than that, it’s nearly useless.
 
This is an English language test and other languages are probably even worse. I haven’t used Siri in many years because it’s so bad. My wife is using it to set timers, but nothing else. If Siri would actually work good Apple may even sell more Apple Watches and HomePods.

The Speakable Items that were part of MacOS classic and early OS X were more capable than this abomination.
 
For now. Apple's AI is not up to speed. But when they are, these others are competition and that violates Apple App Store rules. Plus these others do not protect the children the way Apple's AI supposedly will.

You predict that Apple will bad these other AI apps after their own AI is up to speed?

Do you have any historical evidence of Apple doing this in the past?
 
iOS 6 > WWDC 2012

"The first thing that Siri has learnt in the last 8 months, is all about sports"

I remember that. It’s important to recognize that all the sports things were recent/current. Apple’s updates halfheartedly tried to add more features but Siri was never meant to be a do it all tool. The sports additions were not to answer historical encyclopedic or sports almanac questions like who won a particular title game. Siri was and is just a basic personal assistant. Do a test. Ask the old Siri who won each Super Bowl from I through LX. What does it tell you?

Gruber’s comment that the “old” Siri not answering but offering to search the web shows that such questions are not within the realm of what Siri deals with. That was in the article: "Gruber found that old Siri (i.e. before Apple Intelligence) did a better job at answering a question by declining to answer it, instead providing a list of web links." Siri was/is a personal assistant and not a generalist AI tool (yet).

Apple’s updates are turning it into a generalist tool but it’s obviously not anywhere close to that goal. It’s clearly confabulating the answers like GPT 3.5 or many other LLMs do. The newest LLMs still have that problem but are getting better. As newer, more powerful LLM features are more integrated with Siri it should get better.
 
Last edited:
  • Like
Reactions: FernMam
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.