Given that millions and millions of people will run AI queries, it seems fair to test it repeatedly to see if it can give a proper answer each time.Did they run the test 10000 times until the AI tripped?
Given that millions and millions of people will run AI queries, it seems fair to test it repeatedly to see if it can give a proper answer each time.Did they run the test 10000 times until the AI tripped?
4th paragraph of the article says they used o1.Did they use the Preview-01 model from OpenAi? (Preview-01 is allowed more compute time so that it can produce better reasoning in its output.)
Yeah, this is exactly how significance testing works. I haven't read the paper, but it is assumed that they would have tested a large number of queries, a large number of times, since that's how you approximate to the true result (law of large numbers). A published, peer reviewed study should also find statistical significance, meaning that it is extremely unlikely that chance alone could explain the result. I'd imagine they also tested different contexts surrounding the question itself (though maybe not), since a model that gets is right in isolation, but fails when there is additional—tangentially related, or unrelated—content in the context window, is clearly not using reason (which would remain consistent regardless of long-term context).Claude solved this on first try so this news is a big nothing burger. Funny that the study says "the majority of models fail to ignore these statements". So there were models that worked fine but they only cherry-picked the worst ones? Smells of bias.
Actually ChatGPT 4o also solved it on first go, so what the hell? I actually ran this multiple times with both models and it never came out wrong. Did they run the test 10000 times until the AI tripped?
View attachment 2437141
Claude solved this on first try so this news is a big nothing burger. Funny that the study says "the majority of models fail to ignore these statements". So there were models that worked fine but they only cherry-picked the worst ones? Smells of bias.
Actually ChatGPT 4o also solved it on first go, so what the hell? I actually ran this multiple times with both models and it never came out wrong. Did they run the test 10000 times until the AI tripped?
View attachment 2437141
AI has mainly been the always shady tech industry trying to cash in on a new big thing. It is not like it is actually hard for large software makers to convince media and politicians that something is huge and deserves massive investment. Add in the threat that other countries (China, etc...) may beat the US to the new technology and you have the makings of a true bonanza.If this surprises you, you've been lied to. Next, figure out why they wanted you to think "AI" was actually thinking in a way qualitatively similar to humans. Was it just for money? Was it to scare you and make you easier to control?
It is the more modern version of early (and sometimes) modern pages on Wikipedia where people "source" something that has zero factual basis.LLMs have no reasoning ability. They're literally just next word guessing algorithms fed on the stolen works of others, and worse, public forum posts. There's no intelligence at work, at all.
That’s the differentiator. Anyone truly following this industry knew that intuitively (or from experience). But Apple actually assimilated the actual data. So let’s not bash them too hard for this.Anyone that uses an LLM on occasion should know this from experience. Nice to see some data though.
It seems you might be the one with the skewed perspective. Do you use it very often?If this surprises you, you've been lied to. Next, figure out why they wanted you to think "AI" was actually thinking in a way qualitatively similar to humans. Was it just for money? Was it to scare you and make you easier to control?
It's funny to me that we act like if AI isn't "sentient" then that means it's complete BS. Why not just recognize what it is good at and stop with the unnecessary hype?