Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
Yeah, but that’s now, with cloud LLMs heavily subsidized. Wait for them to charge you the real cost, and we’ll see if it’s sustainable.
In the meantime what’s the alternative except either investing massively in new hardware, or use the cloud now which is much faster than my 4090 and change later if needed?
 
In the meantime what’s the alternative except either investing massively in new hardware, or use the cloud now which is much faster than my 4090 and change later if needed?
In the meantime, indeed, it’s a good idea to take advantage of those heavily subsidized services… but keeping in mind that those aren’t the real costs, so that we can build an alternative as soon as they start to rise in price.

I mean, I’m using Claude right now, but I keep exploring local LLMs to make the switch whenever I need it.

Hopefully Apple will invest more in local AI for future Apple Intelligence tools.
 
  • Like
Reactions: richinaus
For anyone reading this lately and wanting to run the newer models, Gemma 4 31B is pretty good but will go to 80-90gb easily, and that’s before long context. Get as much memory as possible if you want to use these models locally. Hopefully the next generation of laptops will finally go above 128gb, but we have a year wait probably.]

M5 Ultra can’t come soon enough.

What are you using it for? I've been debating between 64GB and 128GB when the M5 Mac Studio comes out and have been leaning towards 64GB because I just don't know if I can stomach an additional $800 (the price of a phone, tablet, or laptop) for just more RAM.

I've been able to get more use out of 16GB of RAM on my M1 Pro MacBook Pro for local LLMs than I was expecting.
 
  • Wow
Reactions: Populus
What are you using it for? I've been debating between 64GB and 128GB when the M5 Mac Studio comes out and have been leaning towards 64GB because I just don't know if I can stomach an additional $800 (the price of a phone, tablet, or laptop) for just more RAM.

I've been able to get more use out of 16GB of RAM on my M1 Pro MacBook Pro for local LLMs than I was expecting.
Research work; if you're happy with what you're getting out of 16gb by all means save the money, but Gemma-4 31B empirically does use 80-90gb even without long context, so I think something like 192gb would be necessary if you used it at long context plus other workloads at the same time.

I'm doing some very specific math heavy things that I can't get into so I am constantly hitting RAM ceilings – earlier tonight I hit OOM around 142gb and that was not even with a large model.

There is a danger in overbuying if you don't really need it, quantized models can be decent but not for the work I'm doing I have to run a lot of stuff at FP64 which is a whole 'nother can of worms. I'm kind of within a niche of a niche right now, but hopefully not forever.
 
  • Like
Reactions: xxray
How heavily subsidized does everyone think the cloud LLMs actually are? I'd heard something like OpenAI costing $1.25 for every dollar they actually charge; reaching historically normal business margins would presumably call for a price increase of about a third and reaching the kind of margins tech companies seem to like to go for these days might call for something closer to double the price at that rate.
 
Research work; if you're happy with what you're getting out of 16gb by all means save the money, but Gemma-4 31B empirically does use 80-90gb even without long context, so I think something like 192gb would be necessary if you used it at long context plus other workloads at the same time.

I'm doing some very specific math heavy things that I can't get into so I am constantly hitting RAM ceilings – earlier tonight I hit OOM around 142gb and that was not even with a large model.

There is a danger in overbuying if you don't really need it, quantized models can be decent but not for the work I'm doing I have to run a lot of stuff at FP64 which is a whole 'nother can of worms. I'm kind of within a niche of a niche right now, but hopefully not forever.
Nonetheless, seeing as there's literally nothing out there available above 128GB, is there a good use case for 128GB, or is it sort of a "tweener" configuration that leaves you better off just sucking it up on a smaller memory if you have to buy now and offload everything else to the cloud?
 
Nonetheless, seeing as there's literally nothing out there available above 128GB, is there a good use case for 128GB, or is it sort of a "tweener" configuration that leaves you better off just sucking it up on a smaller memory if you have to buy now and offload everything else to the cloud?
I’m already hitting the limits so I can’t really do more, if I upgraded it would be for the ~30% speed up and 3x-5x inference time to first token speedup vs. M4.

There ae some experiments I want to do with quantum ‘agents’ which need enormous amounts of memory to approximate with python, I think 3 of them will barely fit within 512GB so I’m really, really hoping that config comes back. My work is not really a normal “I’m just writing some code” use case, I do sometimes use AI to orchestrate local runs – I do not offload my source code to the cloud, it all stays on disk and runs locally. My issue is that a lot of the primary work I’m doing eats ~80-150gb if I let it, which means that to run local AI alongside that is …difficult.

I think 128GB is a fine config for most power users, given how much memory costs I think Apple’s prices right now are a bargain honestly. I expect the studio will raise them, sadly.

For “normal” use a quantized Gemma-4 or a lower parameter count will be fine, it’s a really good model. I did some HLE testing and it solves some of them out of the box, albeit at ~15-45 minutes each on the M4 Max. Also the faster disk on M5 is nice, there is some newer work that streams larger models in via a sort of sliding window mechanism, you can google for more on that. People are getting good results with the newest Qwen also but I haven’t tried that yet.

TL;DR I think if it’s a computer you’re going to keep for a long time I think buying the most you can is worthwhile, if you are a power user. If you’re just curious and don’t have a use case, probably stick with less.
 
Last edited:
  • Like
Reactions: dblissmn
I’m already hitting the limits so I can’t really do more, if I upgraded it would be for the ~30% speed up and 3x-5x inference time to first token speedup vs. M4.

There is some work I want to do with quantum ‘agents’ which need enormous amounts of memory to approximate with python, I think 3 of them will barely fit within 512GB which is the minimum I need for one of my projects so I’m really, really hoping that config comes back. I’m doing work with things like diffusion and complex graphs and lots of things like that so it’s not really a normal use case, I also sometimes use AI agents to orchestrate local runs – I do not offload my source code to the cloud at all, it all stays on disk.

I think 128GB is a fine config for most people, given how much memory costs I think Apple’s prices right now are a bargain honestly. I expect the studio will raise them, sadly.

For “normal” use a quantized Gemma-4 or a lower parameter count will be fine, it’s a really good model. I did some HLE testing and it solves some of them out of the box, albeit at ~15-45 minutes each on the M4 Max. Also the faster disk on M5 is nice, there is some newer work that streams larger models in via a sort of sliding window mechanism, you can google for more on that.

TL;DR I think if it’s a computer you’re going to keep for a long time I think buying the most you can is worthwhile, if you are a power user. If you’re just curious and don’t have a use case, probably stick with less.
This is helpful, thanks. I have a MBP on order with 128, the key thing being I really want to do local AI on large PDF archives and book manuscripts and generally raise my game on research. This in addition to multitasking tendencies and the usual collection of photo/video etc.
 
  • Like
Reactions: novagamer
This is helpful, thanks. I have a MBP on order with 128, the key thing being I really want to do local AI on large PDF archives and book manuscripts and generally raise my game on research. This in addition to multitasking tendencies and the usual collection of photo/video etc.
Don’t sleep on GPT Pro if you can afford it and are OK with the possibility of research being leaked or reviewed by humans. If you aren’t doing critical IP / patent-level work it’s very worth the $200/mo, if you are the cost is eye watering ($160 per million output tokens for 5.5 Pro). Obviously if you use any paid model turn off training on your data, which you can for Claude and GPT but not Gemini unless you use a private chat which does not persist state.

For the consumer version use the web and “extended research” and occasionally combine it with ‘/deep research’ to get the breadth. I don’t think any local model will come close to being able to synthesize like GPT Pro can for probably a year, but I hope I’m wrong. I like Claude much better for most things but for breadth and depth pulling in 600 documents GPT Pro is extremely good.

I used AI to translate some estoeric research work from a foreign language for me and used that to synthesize some other stuff, it’s pretty great the tools we have access to despite the obvious societal downsides.

Local models will do just fine with converting PDFs and managing things, they are pretty capable now, and some can even wire into harnesses similar to Claude Code, but I haven’t seen good demonstrations for the web search abilities like I have with the frontier labs; they simply have more reach and speed right now – although I haven’t done a ton of experimenting with local models for this purpose.
 
Fun thread. 🙂
I've been running Qwen-Coder-Next 80B on my sort-of aging M1 Max MBP 64GB. A whole lot of playing around and a few bake-offs, with the intent of using via CLI and for vscode integration via Continue. I'm not 100% dialed in yet, but am running 64k context, and am pretty much at the bleeding edge - as someone else mentioned, I certainly had hit memory pressure red to the point of shutdown during some fine-tuning.

I'm running via locally configured and compiled llama.cpp as a launch service, serving local and for a few Linux dev VMs. Exact model: unsloth_Qwen3-Coder-Next-GGUF_Qwen3-Coder-Next-Q4_K_M.gguf
llama.cpp compiled with Metal support, etc.

Started down this path after being burned badly with Bolt (*super short* context it seems), and have worked in AI/ML for a handful of years (although not in LLM/generative space), so wanted to see where I could get to. This led to a long but useful rabbit hole of building my own agentic AI guardrail system, and reduced credit/token use by over 90% (no, it's not 'magic prompts' 😉 ), but also went looking at how far I could go locally.

Some interesting bits - I did compare vs Devstral/Mistral variants and a few others. QwenCoder-Next is an MOE model with selective hyperparam activation. The difference in response/tokens per second as well as time to response was a HUGE difference with Qwen3-Coder-Next coming out on top by quite a bit. The others simply weren't usable without screaming. Yes, I could have dropped down on the params and looked for a 30B (I only use 1.5B for code completion), but this is where I landed <right now> after a good amount of testing including code tests and reviews.

I probably could reduce my context size a bit, and increase max_tokens and prediction sizes.

In general, I use 'web ai' for planning discussions, finalize into a 'spec' I dump in the repo, then let either Codex or local model implement, or I implement part, then have it refer to the spec to complete and test. Sometimes I'll round robin and have web AI give me the local agent prompt and feedback ack/forth while jumping in if/when I see things going funny/wrong, etc.

Continue was kind of a huge pain to get sorted, so was vscode honestly. I'm a former sr sw engineer/architect, but moved into leadership roles some time back, so part is fun, but part is also me being a bit rusty, and I never preferred MS tools for well - anything at all, as a long-time Unix/Linux engineer. Did a few frustrating rounds of trying o get proper permissions enabled in Continue, while vscode config is literally *polluted* IMO with copilot settings, which was like a frigging virus - every time I thought I had banished it forever from my vscode, it found a way to pop back up like a virus. It didn't help with so MANY settings in vscode that sure sounded generic but were literally only for copilot, but got there, and will never touch/enable copilot in vscode again/it's banished forever for me. 😀

I also in general, refuse to shift to pay-per-token/API models at this point, as it's quite easy to see things jump pretty significantly. Of course, inevitable as someone said, pricing WILL go up, but if you're using paid models/plans, you do NOT need flagship Opus or GPT for 90% of most tasks. A buddy was burning around $1k/month, and a prior very senior sw engineering team I had managed was routinely running out of tokens. While some of what they're doing is pretty complex work, 90% of it could be done on Haiku, 5.4-mini or similar, and the price multipliers are HUGE on the flagship models. Right now, I use a paid GPT account for web AI, I keep a small paid plan on Bolt (always use your own project github, private is fine) for claude/haiku/sonnet/opus), and swap between codex (mostly 5.4-mini medium) and Qwen for coding tasks.

I know we're mostly talking about local, but ask your AI of choice or see if you can actually find a rate page for your brand model of choice - the difference in flagship vs 'pretty damned good' can be pretty crazy, even approaching 10x or more. Until OpenAI hosed me a bit (they seem to have dropped non-Spark 5.3-codex from the list in Codex app), I'd do 5.4-mini or Qwen, only occasionally bumping up to 5.3-codex. 5.4-mini is around 15% of GPT5, and still like 25% of 5.4 non-mini, while Codex is a bit less that 1/2 of the flagship, and slightly less than 5.4 non-mini. (not spark - they don't even tell you what it costs, which is hokey). So yeah, if you're going to use paid, consider which specific model to use when and your bank account and sanity will thank you.

Qwen3-coder-next generally performed at least as well as 5.4-mini and approaching 5.3-codex, occasionally but rarely slightly better, most of the times not-quite-as good. Single file/simple edits/patterns zero issues, but when you get into more complex tasks as in repo-wide refactors, this is where you want a higher-powered model or a detailed execution plan, possibly worked out with a web ai model with stronger planning/reasoning vs coding-focused agentic models.

So yeah, I now use Qwen and Codex almost interchangeably in practice, was doing Qwen for 90% of the work, but as I put my agent framework/guardrails in place and watched my credit usage plummet, it's a flip of the coin as to which I use. I probably need one more round of tuning on Qwen, but it's not locking up the system any more (yay 😉 ), although Codex remains faster vs the M1 Max.

Next step most likely is an M5 Studio 256GB if/when they announce it. I'll offload model use there, and keep the M1 Max MBP for a while longer as my travel system, and see what the 'mbp ultra' actually turns out to be. The M1 Max MBP is still an overall competent system - right up until I wanted to push the bleeding edge on RAM use with it. 😀
 
  • Like
Reactions: novagamer
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.