Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
Images mapped with text is a known limitation of generative image creation, GenAI is terrible for this across the board and they have their own sub-experts to even get text spelled correctly which still can fail sometimes. There's a long way to go there, I'd never expect them to make a map right now.

That said, it does get the number of states with "R" in the name correct. People have a vested interest in engagement and clickbait nonsense because GenAI has a lot of emotion around it. This is why I advise everyone to run their own tests and use all models available if they are pertinent for your work or adjacent to your interests.

Not everyone needs to use these of course, but for those that want to, they should be informed. There is a lot of misleading information around and things like that keep people with the belief that the technology is where it was 2 years ago which isn't true.

No one should trust what I or anyone else says in here fully, watching the live demos and trying them for yourself is the only way to know how. Reading Simon Willison's blog to understand usage tips and staying current with development also helps.

There is a lot of "there" there without being reductive. But it's easy to take pot shots and snipe at each other because this technology is undeniably disruptive and that makes a lot of us understandably uncomfortable.
I gave GPT 5 the same prompt and it told me 22. Apparently there's an R hiding somewhere in New Mexico.

Setting aside the incredible waste of resources and damage to the environment, these things are inherently unreliable. It's not something these companies can fix because it's the nature of the tech they're built on. It's easy to take potshots at these things because they mess up all the time.
 
I gave GPT 5 the same prompt and it told me 22. Apparently there's an R hiding somewhere in New Mexico.

Setting aside the incredible waste of resources and damage to the environment, these things are inherently unreliable. It's not something these companies can fix because it's the nature of the tech they're built on. It's easy to take potshots at these things because they mess up all the time.
Did you use the exact prompt I did? Asking it to think is critical. These types of specific knowledge tasks work much better if you either add in web search (e.g. for asking about some recent documentation on a an API update, "pull from x website where the new docs live then answer my question") or if you activate MoE / CoT / ' Reasoning' via 'thinking'.

You can write it off but it gets it correct for me on multiple attempts.

Grok 4 also got it right on the first try for me just now, and activated 'thinking' on its own.

For some questions you don't need to use thinking, for some you do. This stuff requires a bit of know-how regarding how to prompt and what level of effort to tell the model to use. That may make it less useful for people that don't know that, but it doesn't mean the capability isn't there.

It does make mistakes of course, especially if you don't meta-prompt well or anchor to something solid, but there is a lot of utility in what's there. These quick tests aren't as useful as actually seeing if it integrates with your workflow in a positive or negative way.

For me, Claude Opus 4 and now 4.1 is indispensable for my work and assistance with research, but I know enough what to check and am careful with my prompting. It seems like GPT-5 will be useful also which is great because I didn't get along well with 4o, but I will miss 4.5 because I threw some really long research tasks into that and got excellent results.

I'm really most excited for the competition to heat up now and push Anthropic to get their pricing down and advance their rollout. I hope they continue to stay afloat for a long while.

edit: for clarity, you need paid accounts to use the features I'm talking about above. Free versions will not offer these tools, OpenAI was very specific about the level of model being offered to the free tier. It's a big step up in quality to pay the $20, subject to usage quotas of course.
 
Last edited:
Did you use the exact prompt I did? Asking it to think is critical. These types of specific knowledge tasks work much better if you either add in web search (e.g. for asking about some recent documentation on a an API update, "pull from x website where the new docs live then answer my question") or if you activate MoE / CoT / ' Reasoning' via 'thinking'.

You can write it off but it gets it correct for me on multiple attempts.

Grok 4 also got it right on the first try for me just now, and activated 'thinking' on its own.

For some questions you don't need to use thinking, for some you do. This stuff requires a bit of know-how regarding how to prompt and what level of effort to tell the model to use. That may make it less useful for people that don't know that, but it doesn't mean the capability isn't there.

It does make mistakes of course, especially if you don't meta-prompt well or anchor to something solid, but there is a lot of utility in what's there. These quick tests aren't as useful as actually seeing if it integrates with your workflow in a positive or negative way. For me, Claude Opus 4 and now 4.1 is indispensable for my work and continuing education, but I know enough what to check and am careful with my prompting.
I copied the exact prompt from the screenshot. But either way, does the fact that you have to instruct it to "think carefully" for such a simple task not raise a big red flag for you? Should it not be thinking carefully for every task, if that's required to make it do the job properly?

I ran the exact same prompt again in a new chat and it gave me the right answer this time, which again goes to the point of this stuff being unreliable.

It's just insanely underwhelming knowing how much money has been flushed away on this.
 
I copied the exact prompt from the screenshot. But either way, does the fact that you have to instruct it to "think carefully" for such a simple task not raise a big red flag for you? Should it not be thinking carefully for every task, if that's required to make it do the job properly?

I ran the exact same prompt again in a new chat and it gave me the right answer this time, which again goes to the point of this stuff being unreliable.

It's just insanely underwhelming knowing how much money has been flushed away on this.

As one example, right now I'm learning some facets of a programming language I'm not intimately familiar with, and I'm using Opus as an aid to a lecture series that makes some mistakes in pedagogical class ordering. I'm using it kind of like a TA, not using to write code or autocomplete my work but to help explain some poorly documented APIs and Packages, as well as usage questions when something I do happens to work that I didn't expect, or vice versa.

It's been easily as valuable as a real tutor or moderate-tier pair programmer would be to me, and it's far cheaper, always available, etc. I don't expect it to write programs in their entirety for me, nor would I want it to. I also don't expect to offload my own mental tasks or truly deep-thought knowledge work to these tools either, I need to stay sharp to earn income and continue to excel in my profession.

But both things can be true with these tools now, they are flawed, but can still be quite useful. Claude code is used by a lot of people with great results. I don't have a workflow I can integrate it with at the moment, but down the road when company policies change and the technology advances I can see that being different. Understanding usage patterns and best practices is, to me, beneficial, because I can speak with some authority on the capabilities and what I'm getting out of the models.

I don't think the utility is in asking it to count the number of states with "r" in the name. This sounds like a pithy reply but I don't mean it that way. An additional add-on to that would be to say something like "if you can't find utility in these tools, the problem is with you, not the tools." I don't agree with that statement in its entirety, but there is something there, at least in my experience.

I fully expect LLMs will be mostly obviated within ~5 years, world models should replace them entirely but the real timeline could be anywhere from 2028-2035 all the way to never if the technology fails to materialize. But for now, as of this year, they are pretty useful in broad areas, at least to me. Your mileage may vary.
 
  • Like
Reactions: GrassShark
As one example, right now I'm learning some facets of a programming language I'm not intimately familiar with, and I'm using Opus as an aid to a lecture series that makes some mistakes in pedagogical class ordering. I'm using it kind of like a TA, not using to write code or autocomplete my work but to help explain some poorly documented APIs and Packages, as well as usage questions when something I do happens to work that I didn't expect, or vice versa.

It's been easily as valuable as a real tutor or moderate-tier pair programmer would be to me, and it's far cheaper, always available, etc. I don't expect it to write programs in their entirety for me, nor would I want it to. I also don't expect to offload my own mental tasks or truly deep-thought knowledge work to these tools either, I need to stay sharp to earn income and continue to excel in my profession.

But both things can be true with these tools now, they are flawed, but can still be quite useful. Claude code is used by a lot of people with great results. I don't have a workflow I can integrate it with at the moment, but down the road when company policies change and the technology advances I can see that being different. Understanding usage patterns and best practices is, to me, beneficial, because I can speak with some authority on the capabilities and what I'm getting out of the models.

I don't think the utility is in asking it to count the number of states with "r" in the name. This sounds like a pithy reply but I don't mean it that way. An additional add-on to that would be to say something like "if you can't find utility in these tools, the problem is with you, not the tools." I don't agree with that statement in its entirety, but there is something there, at least in my experience.

I fully expect LLMs will be mostly obviated within ~5 years, world models should replace them entirely but the real timeline could be anywhere from 2028-2035 all the way to never if the technology fails to materialize. But for now, as of this year, they are pretty useful in broad areas, at least to me. Your mileage may vary.
To be clear, I'm not saying I never use LLM tools. I've made real efforts to try use them for all sorts of tasks with very mixed results. I'm changing careers and are part of my initial retraining I used various LLMs to quiz me on stuff I had studied. Quite frequently, I would find that the LLM would pose decent questions for me, but then proceed to get the answer to its own question wrong.

I've also used them for things like spreadsheet formulas, but again they often mess these up, formatting things for the wrong programs even after I specify what I want it for.

And finally, importing timetables into calendars — something that should be simple, right? It accurately takes text from a photo or image, formats it for an ical file, and then messes up at the last step of creating a text file, with every attempt to correct it making the problem worse.

And yeah, asking stuff like how many Rs in state names etc isn't exactly a key use, but it's such a basic task that if it can't do that I have to wonder what else it can't do. How can I trust it to do complex tasks if it can't reliably do extremely simple ones?

I'm sure if I paid for the pro versions of these tools and really, really tried, I could get them doing more stuff and save 5 mins here or there, but frankly until I can trust them to not mess things up all the time, why would I bother? Just like with Siri, I find that the in time it takes me to ask it to do the thing (and then re-ask and re-clarify), then verify it's done it right, I could usually have just done the thing myself, knowing I'd done it right and done it the way I want it done.

It seems that coding and coding related tasks may be the only area where these things are actually somewhat good.
 
  • Like
Reactions: novagamer
“OpenAI says that GPT-5 is its best coding model to date”

Obvious thing for OpenAI to say. Why even say it? Why would GPT-5 be worse than GPT-4? So dumb.
 
Grok is an industry joke, nobody takes it seriously because it's benchmark-maxxed junk. Claude models are not always the top benchmark performers but everyone swears by them for coding and real life use cases.

"everyone" I literally had grok 4 fix problems Claude Sonnet 4 couldn't.
 
Yes.


lol Electrek.
frederic Lambert sold his stock (which he bet the house on at one point, literally) and is mad it's 1.5x the value now. he's trying desperately to bring back the stock down so he can get back in. plenty of provably wrong articles he's written in the past.
 
  • Like
Reactions: zakarhino
That’s just plain wrong. Please don’t say something that’s obviously not true without any fact backing it up. You don’t like Grok for whatever personal feeling, fine, but it is far from an industry joke. An industry joke doesn’t get valued at 200 Billion. Come on now.

"An industry joke doesn’t get valued at 200 Billion"

You just exposed yourself. We are in the biggest tech bubble in history and you're quoting a private valuation for an Elon Musk company lol. If it didn't have a massively inflated valuation there would be something massively wrong and we should all be panicking.

Anyway, go to any SF AI event or conference and ask everyone this question: "What models are you guys using?" and you will almost NEVER hear "Grok"
 
"everyone" I literally had grok 4 fix problems Claude Sonnet 4 couldn't.

Sorry but that means nothing. I see the occasional Grok success anecdote on Twitter which is meaningless in an ocean of people swearing by the other models such as Claude. You just said it yourself, you were using Sonnet 4 and ran into a bug it couldn't crack... so you were using Sonnet 4 like almost everyone else. Most engineers I've spoken to in your situation switch to Opus 4, o3, or Gemini 2.5 Pro. Again, almost nobody is using Grok for coding and I have yet to find a single example of people using it for any professional production use case.

It's nothing personal, I've actually paid for SuperGrok (or whatever it's called) in the past and thought it was mostly junk 🤷 And I gave it an earnest go when Grok 4 came out and... it's just meh. It's really embarrassing how the benchmark to real world capability ratio is so garbage, that's why everyone thinks it was tuned for benchmarks and not practical use.

Consider that Anthropic's models are never benchmark topping yet they are very obviously better at agentic workflows (especially coding) than basically anything else. Everyone knows Claude is King for coding, but perhaps GPT 5 will change that -- Grok 4 certainly didn't.
 
What about quantum people like Robert Penrose saying something like intelligence equals being able to be conscious. Then I guess the question is what defines consciousness? According to him consciousness arises from non computational processes… meaning it can’t come from a computer.

Not sure if I agree with that but would be interesting to hear your definition of consciousness…
Intelligence and consciousness are different concepts, the hint is in the fact that they are two different words with different definitions. That being said if consciousness arises from non computational processes then we’re not conscious either as our brain is just a biological computer using neurons instead of copper and transistors.
 
The other day I asked ChatGPT 4o if Pink Floyd ever revealed what "No one flies around the sun" means, in the lyrics of the first half of their track "Echoes", though I already knew the answer (they haven't). Rather than just saying yes or no, ChatGPT answered by initially quoting the lines from the second half of the song, then started to pour out some of the lines from the first half of the song, but then interrupted itself before showing the line in question, and said, "So, this shows that the line 'No one flies around the sun' doesn't appear in this song." I told it that it was wrong, but it insisted it was right. I went around with it a few times until it finally admitted I was right, and then accurately said "No they haven't". Then it said it couldn't really answer questions about song lyrics since it claimed it wasn't designed to quote full lyrics from songs, due to copyright issues (I'd never seen it admit that copyright is a concern for it). But that doesn't really explain its "answers" to my question about just a single line from the song.

I've had a fair number of such "experiences" with ChatGPT about a variety of other questions. Despite its frequent successes, it's still not ready for full-time prime time. The same is almost certainly true for all the others.
 
Last edited:
  • Like
Reactions: turbineseaplane
Do people still use ChatGPT lol? Claude is king!
Can you use Claude without registering and logging in? Cause that's a massive advantage of ChatGPT.

I wanted to use DeepSeek, but since it requires an account, so I have to log in every time (I have my browser set to delete cookies on close), I don't bother.
 
How do you define intelligence and why do you think these model do not qualify?
The ability to gain new insight and wisdom from a new encounter or experience. The ability to recognise, analyse, and instinctually approach a certain task. I suppose though that instinct can be argued is a model.

If you get ChatGPT stuck, it will parrot the same reply over and over. It can't recognise that it needs to change it's approach in order to solve the problem of not what it's asked, but who is asking it.
 
And how can you be so sure that humans don’t do exactly the same thing these AI do (just better -for now-)? How can you tell the difference between working out the correct answer and intelligently answering a question? As long as the answer is correct and the reasoning sound what’s the difference?
I never said humans couldn’t. Of course they can. Intelligence in LLM respect is simply answering the question correctly. Artificial in LLM respect is the none human factor. It’s not ‘real’ artificial intelligence, as in self aware and able to reason independently like (most) humans, but it is still an intelligence which one can access and make legitimate use of.
 
And how can you be so sure that humans don’t do exactly the same thing these AI do (just better -for now-)? How can you tell the difference between working out the correct answer and intelligently answering a question? As long as the answer is correct and the reasoning sound what’s the difference?
So, the brain weighs about 3lbs and runs mostly on alcohol, coffee and porridge (just me?). As opposed to mind-bogglingly expensive, resource intensive farms of GPUs/CPUs that struggle to get close to the truly imaginative, intelligent, free thinking, independent, creative beings that we are. I can take my brain with me wherever I go, without needing a data connection or small power station to keep the lights on. I take a holistic view. We've a lonnnnnngggg way to go yet. For now, I'd settle for "answer is correct and the reasoning sound".
 
Utterly unhinged that someone can write the sentence "GPT-5 is less likely to lie to the user" without any kind of comment or concern about that. On the contrary, this whole piece reads like a corporate press release.
 
Lol all this hype and long months of development only to fall short to Grok 4 which was released earlier and took way shorter time to train.
 
Can we somehow make chat gpt our default Siri?

The problem is that at least in my experience these LLM systems still fail badly at some of the basics I want for a smart assistant. I think this might be one reason why Apple and Amazon are having such difficulty in bringing LLM technology to their Siri and Alexa smart assistants.

There's a standard question I ask because it's an example of one of my fairly common queries relating to travel that I ask my smart speakers. I just asked ChatGPT "What time is the next train from <my local train station> to <a major London train station>?". That major London station is on a direct line from my local station so in all cases its only a single train journey.

In fairness ChatGPT did better than it has done in the past. In the past it would simply invoke a route planner to get from my current destination to the London terminus and then tell me when I needed to leave my home to walk to the bus stop, catch a bus to my local station (which I never do, it's an 8 minute walk), and then catch the train. In the past it then went on to tell me that if I wanted the exact times of the trains I should go to the web site of my local train company and look up the timetable there.

Anyway, today ChatGPT actually recognised the precise intent of my question and gave a direct answer, that the next train was at 10:04. It even gave me the departure platform and gave times and platforms for another 3 trains after that one. Useful if I missed the first train on the list.

The problem is that it also answered my follow-up question correctly. "What time is it now?". It gave me the correct answer - 10:41" So the "next train" it gave me had already departed 37 minutes ago and in fact the following 2 trains had also already departed at the time it gave me its answer.

To get a decent reliable user experience for a general purpose smart assistant there really does still seem to be a lot more work to do because I suspect this sort of total failure to give a correct & useful answer to a travel question is by no means the only edge case where current LMMs fail badly.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.