Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
Finally, LLMs and all other forms of "generative AI" are a hype bubble that is going to pop, not the foundational technology of the next 25 years. If you're curious about them, cool, but don't fall for the FOMO being spread by those inflating the bubble.
Don’t fall for the bubble…. What the heck? I’m using them every single day for work. We are building agents to do everything from sorting support tickets to helping sales people quickly create proposals. Our engineering team is using them to build new software projects fast and to quality test old features that we couldn’t before.

What world are people living in?
 
What is an "inference machine"?

That’s a good question. I’d like to know the answer as well.

Inference does not need the gradient computation or backpropagation, so maybe it’s more easily expressed as streaming devices? No idea. Maybe someone with a stronger ML background can comment. What are the principal differences in requirements for training and inference?
 
  • Like
Reactions: splifingate
Don’t fall for the bubble…. What the heck? I’m using them every single day for work. We are building agents to do everything from sorting support tickets to helping sales people quickly create proposals. Our engineering team is using them to build new software projects fast and to quality test old features that we couldn’t before.

What world are people living in?

While I am not sure I completely share @mr_roboto’s skepticism, there are a lot of problems with LLMs. They are inefficient, prone to hallucination, unable to distinguish fact from noise, and difficult to optimize. Not to mention that we seem to have run into a plateau (GPT-5 is barely better than 4 and sometimes breaks down completely). It’s currently popular to use LLMs as a proxy for general AGI and simulating thought processes, but that’s an extremely awkward crutch. Future AI systems might look rather differently.
 
I am int the camp where I find LLMs very useful, especially for quickly finding stuff since google search sucks more and more. Also used in the right way, understanding how they work makes routine tasks a lot faster. Just restrain yourselves when using it.
By design, LLMs hold a "compressed database" of a lot of information, basically the whole web + some other unknown stuff. Commonly recurring patterns will hallucinate very little. So as long as one works with that kind of stuff, it will be quite awesome. Seldom write python scripts from scratch anymore, and for web it seems very good ( although I am not a web coder). For more niche knowledge, there is more randomness, or lock down to the patterns it has seen.
I mainly code in C99 and there is has been quite sucky. But still, for simple things, it will produce very useful output even if it will not be up to date with the latest developments for apis that change etc.

My impression is that for coding, GPT5 is a lot more solid. At least if you iterate/collab.
Biggest downside is if you let it write too much, you will not be able to change code manually since you do not understand what it is doing etc.

My main gripe is that the really good models are online and owned by large corps. So I am also looking into having my own backend servers. A Mac is quite capable these days, especially for smaller size models targeting well defined tasks.
But tbh, very unsure on what to invest in. At work we have 4090s and H100 for ML on prem training, but we do not use any LLMs atm, it has just been for my own experiments.
It is so odd(but I know the reasons oc) that a Mac can run LLMs quite reasonably while they positively suck for doing things like "classic cnn" visual models. Did some benchmarks on m3 max, 4090 and h100 and we are talking a difference in magnitude.

Really hope this changes.
So for a personal professional backend I am not at all sure what to get. For visual work, I would get a dual 5090 prebuilt by some integrator. For running LLMs on a budget? well, that's hard, seems the m4 max is at a sweet spot right now, otherwise a Blackwell pro with 96GB ram...
 
  • Like
Reactions: throAU
That’s a good question. I’d like to know the answer as well.

Inference does not need the gradient computation or backpropagation, so maybe it’s more easily expressed as streaming devices? No idea. Maybe someone with a stronger ML background can comment. What are the principal differences in requirements for training and inference?

It's the Ticket-Out-The-Door that helps let me know what my students learned this day.

What I've learned is that I have about a thousand choices at my disposal; and, that there are a non-quotient, non-qualitative few which might speculatively allow me to open a door that is unspecified, and which may not open to a room that I might desire.

Am I willing to bet USD 10K on my basement?
 
It's the Ticket-Out-The-Door that helps let me know what my students learned this day.

What I've learned is that I have about a thousand choices at my disposal; and, that there are a non-quotient, non-qualitative few which might speculatively allow me to open a door that is unspecified, and which may not open to a room that I might desire.

Am I willing to bet USD 10K on my basement?

I am not sure I follow.
 
While I am not sure I completely share @mr_roboto’s skepticism, there are a lot of problems with LLMs. They are inefficient, prone to hallucination, unable to distinguish fact from noise, and difficult to optimize. Not to mention that we seem to have run into a plateau (GPT-5 is barely better than 4 and sometimes breaks down completely). It’s currently popular to use LLMs as a proxy for general AGI and simulating thought processes, but that’s an extremely awkward crutch. Future AI systems might look rather differently.
You listed a lot of problems but none of the benefits. The benefits are world changing despite the problems.

You'll have to define inefficiency because they appear to be extremely profitable in terms of inference cost: https://martinalderson.com/posts/are-openai-and-anthropic-really-losing-money-on-inference/

Prone to hallucination - so do humans. People need to learn to use LLMs the right way. If the task requires high accuracy, you should have a human to check. Foundational models continue to hallucinate less and less, couple that with verification tool use, it can get a lot better quickly.

Difficult to optimize - you'll have to be more specific here.

(GPT-5 is barely better than 4 and sometimes breaks down completely) - You should measure the capabilities of GPT4 vs GPT5. If you look at this chart, it's as big of a leap as GPT3.5 to GPT4. The only reason we don't really notice is because OpenAI has been releasing incremental updates to 4o instead of one big one every few years. If you use LLMs to have basic conversations and to answer simple questions, you're not going to notice the different between GPT4 and GPT5. If you are a professional who rely on LLMs for work, you should notice a big difference. The general sentiment around GPT5 is from the masses, which is the wrong way to measure AI progress. It was already more than good enough for general questions at GPT4 level.

1756966755592.png
 
Last edited:
Prone to hallucination - so do humans. People need to learn to use LLMs the right way. If the task requires high accuracy, you should have a human to check. Foundational models continue to hallucinate less and less, couple that with verification tool use, it can get a lot better quickly.

This.

People compare LLMs to junior humans like they expect either to be perfect.

They aren't.

Think of LLM based AI as a junior you can assign a bunch of tedious work to grind out.

Will you need to check it before using it? Yes. Is it quicker to check/fix than write or find it all from scratch? 100% yes.

LLMs are just another tool in the toolbox, like a calculator or a spreadsheet. You can pretend they don't exist, but it won't change the fact that those who know how to use the new tools will leave you behind.

They're great at search if you use them for that. They can filter through most of the trash and then give you a fully cross-referenced, cited bunch of content in a fraction of the time it would take to read through one or two of the first links.

Traditional search is mostly dead to me, google see the writing on the wall as well and are transitioning to AI queries by default, they seem to be doing a poor job of that though (or maybe I just got some bad results and didn't give it enough of a chance).
 
Even in coding, generative models can hallucinate and/or provide the wrong information entirely unless you first provide a clear outline of what you need to the model you are interacting with. Depending on the specific model being used, the type and amount of information needed to generate a desirable result can vary. I would warn against treating "fully cross-referenced, cited" responses as certificble fact, because there are numerous instances of these models hallucinating and completely fabricating sources, so you would need to personally verify those sources are a) real and b) saying what the ML model claims.
 
  • Like
Reactions: Basic75
Even in coding, generative models can hallucinate and/or provide the wrong information entirely unless you first provide a clear outline of what you need to the model you are interacting with. Depending on the specific model being used, the type and amount of information needed to generate a desirable result can vary. I would warn against treating "fully cross-referenced, cited" responses as certificble fact, because there are numerous instances of these models hallucinating and completely fabricating sources, so you would need to personally verify those sources are a) real and b) saying what the ML model claims.
They're a lot better now - to the point that I've had coding agents one-shot fully working apps when I gave it specs.

It's much different than in the past when people are just copy and pasting code from ChatGPT or RAG based LLMs. Now models are fully integrated into coding editors agents that can search, can test, can compile.
 
They're a lot better now - to the point that I've had coding agents one-shot fully working apps when I gave it specs.
I don't have any experience with local LLMS on my Mac but I've been using chatgpt. I compared chatgpt, grok, and copilot to produce a python script. grok, failed, chatgpt worked but needed tweaks, and copilot was the most fleshed out with it being easier to extend the script.
 
I don't have any experience with local LLMS on my Mac but I've been using chatgpt. I compared chatgpt, grok, and copilot to produce a python script. grok, failed, chatgpt worked but needed tweaks, and copilot was the most fleshed out with it being easier to extend the script.

I've had different experience with copilot.

All i wanted to do was extract a table into CSV from a PDF (this was a few months ago). I just wanted to get the quote out of PDF and into excel so then cut/paste into our ordering system (yeah stone age cave man stuff) would legitimately just pick the exact cell contents, and not spaces and other crap that may have been in the PDF table.
  • Copilot wanted me to install python and run a script ... like... sure i am an admin, i could do that, but no... you're the AI, figure it out.
  • llama 70b got confused but almost did it, tended to merge some of the cells it should not have but was super close.
  • locally run GRANITE 8B (IBM model) just did it right off the bat.
I have paid chatGPT (personal) and copilot (work) accounts, haven't played with any of the other cloud models yet but chatGPT has been a massive help accessing/deciphering/finding microsoft docs and pretty decent at powershell one-shot scripts for admin stuff.

I have found chatGPT search/deep research quite useful. It helped me find an in stock brand new RAV4 in the colour and spec i wanted within 50 km of me, whereas my local dealer said there would be a 3 month wait :D
 
Don’t fall for the bubble…. What the heck? I’m using them every single day for work. We are building agents to do everything from sorting support tickets to helping sales people quickly create proposals. Our engineering team is using them to build new software projects fast and to quality test old features that we couldn’t before.

What world are people living in?
Not in a business environment where LLM, with caution, work well. In consumer world LLM’s are designed to groom users in an addictive manner. Quite a few papers and articles warning about this. Already AI religious cults in TikTok and YT.
 
Not in a business environment where LLM, with caution, work well. In consumer world LLM’s are designed to groom users in an addictive manner. Quite a few papers and articles warning about this. Already AI religious cults in TikTok and YT.
Yea, people live in a weird world. What do you guys do for work?
 
Not in a business environment where LLM, with caution, work well. In consumer world LLM’s are designed to groom users in an addictive manner. Quite a few papers and articles warning about this. Already AI religious cults in TikTok and YT.

Not even sure if they're designed to do that, or that we live in a world with a lot of sad lonely people.
 
You should measure the capabilities of GPT4 vs GPT5. If you look at this chart, it's as big of a leap as GPT3.5 to GPT4. The only reason we don't really notice is because OpenAI has been releasing incremental updates to 4o instead of one big one every few years. If you use LLMs to have basic conversations and to answer simple questions, you're not going to notice the different between GPT4 and GPT5. If you are a professional who rely on LLMs for work, you should notice a big difference. The general sentiment around GPT5 is from the masses, which is the wrong way to measure AI progress. It was already more than good enough for general questions at GPT4 level.

View attachment 2543348

It is clear that GPT5 shows stronger performance in benchmark suites. What I am talking about however is experience actually using the model.

I found GPT5 to be inconsistent to the point of irritation. For example where 4.1 was able to iteratively refine text passages through incremental prompting, GPT5 often breaks style in an unpredictable way, and the only way to get it bsvk on track is to restart the chat. For the kind of coding I do it’s still barely usable. Yes, it can do basic stuff, but it still generates awful and often incorrect code for anything that requires even a little bit of algorithmic thinking.

I don’t know enough about LLMs to articulate my thoughts better. I just wonder whether token prediction is really the best way to model intellectual discovery processes.

This.

People compare LLMs to junior humans like they expect either to be perfect.

They aren't.

I don’t know what your definition of “junior human” is. I evaluate the models based on how they can help me do my work. So far they work best for writing documentation, metadata, boilerplate, and unit tests (needs to be carefully prompted). I also had some success using them for identifying stylistic and communicative gaps in written text, and for exploring idea spaces. They are also great for learning, and here I admit they saved me a lot of time exploring unfamiliar concepts. I found LLMs - including the latest models - of very limited use for coding, data analysis, or more complex stats, because they will make mistakes that are hard to spot.

Basically, the big difference for me is that a human will admit if they are uncertain or simply don’t know how to do something. LLM will always give you an answer that sounds super reasonable but is actually hot garbage. Even GPT5 makes really bad factual errors with super simple text summaries.
 
I don’t know what your definition of “junior human” is.

Someone who you can't trust to submit important work without checking it first.

Think: equivalent of a junior programmer publishing code without review, a junior editor publishing to the internet without review, etc.

I'm not saying AI can never generate useful content, I'm just saying its on YOU as the end user to validate it before blindly trusting it. Because it will reflect on YOU if it's garbage. By all means use AI tools, just validate before relying on the output. The AI won't care if its right or wrong, and there's no consequences for IT. You however have a reputation/job to keep/etc. - so take proactive steps to ensure that.

Basically, the big difference for me is that a human will admit if they are uncertain or simply don’t know how to do something. LLM will always give you an answer that sounds super reasonable but is actually hot garbage. Even GPT5 makes really bad factual errors with super simple text summaries.

I've had junior staff exactly like that (LLM behaviour). Ones who are somewhat on the spectrum and have an inflated sense of their own abilities.
 
I found GPT5 to be inconsistent to the point of irritation. For example where 4.1 was able to iteratively refine text passages through incremental prompting, GPT5 often breaks style in an unpredictable way, and the only way to get it bsvk on track is to restart the chat. For the kind of coding I do it’s still barely usable. Yes, it can do basic stuff, but it still generates awful and often incorrect code for anything that requires even a little bit of algorithmic thinking.
Are you using ChatGPT Plus or free? GPT5, free or Plus, has a router that chooses between standard, low, and thinking modes. Perhaps it's routing your stuff to the low mode.

I mainly use coding agents and GPT5 is significantly superior to GPT 4.1. I only used Claude 4 in the past but have now been using GPT5 more.
 
Someone who you can't trust to submit important work without checking it first.

Think: equivalent of a junior programmer publishing code without review, a junior editor publishing to the internet without review, etc.

I'm not saying AI can never generate useful content, I'm just saying its on YOU as the end user to validate it before blindly trusting it. Because it will reflect on YOU if it's garbage. By all means use AI tools, just validate before relying on the output. The AI won't care if its right or wrong, and there's no consequences for IT. You however have a reputation/job to keep/etc. - so take proactive steps to ensure that.



I've had junior staff exactly like that (LLM behaviour). Ones who are somewhat on the spectrum and have an inflated sense of their own abilities.
I agree. For some strange reason, people are expecting LLMs to never make mistakes. Instead, they should always check the LLM's work if it's important. If a task is not that important, then don't check. Learn how to use LLMs instead of thinking it's not "AGI" so it's not worth using.
 
  • Like
Reactions: throAU
A bit of a non-sequitur - If you're one who uses a cloud LLM and leverage it every day - IMO, some future under 2lb, 11" Macbook would be a perfect device to use it.

I sometimes see myself wanting to use a device to leverage that LLM, but am not at a desk or want to get out a laptop, and need more than a phone or tablet. That 'smallest viable laptop' would be ideal. For example, when thinking about some problem, I'll have of an idea of question and want claude-code to tackle it - it'd be nice to flip open some kind of small, fully-powered, companion device for a couple of minutes.

Kind of how I used an 11" Air back in the day, only now with more purpose...
 
Is there a solution to the basic problem in AI/LLMs etc: Repeat something enough times, it becomes the truth?
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.