Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

jeroenvip

macrumors regular
Original poster
AI is undeniably something we can’t move away from when it comes to coding. Currently, I’m using Claude Max, which costs $100 per month. I’m a heavy user and hit the usage limits quite often. Today, I saw this on my X feed:


I’m considering replacing Claude Code with a local LLM and using that as my primary coding agent. I also believe that the future of software development will increasingly involve local LLM usage over the next couple of years.

Here’s the catch: to run a model like this locally, you realistically need at least 64 GB of RAM. On top of that, your IDE and all your other tools still need resources.

So the question is: does it make sense to invest in an M5 Pro or M5 Max (when they’re released) with 64 GB of RAM, with the idea of using it for the next five years and effectively offsetting the ongoing cost of a Claude Max subscription?
 
Playing around with LM Studio just recently I'm asking myself the same question - Apples product lineup really sucks at that point. I really hope that M5 Pro will have the option to go beyond 48GB, because what I need is more RAM and not more cores. I can't justify buying M5 Max for 1.5k bugs more just to get 64GB or more of RAM.
 
  • Like
Reactions: hovscorpion12
I am considering something similar. That is, I’m thinking I might replace my M4 Pro Mac mini for an M5 Ultra Mac Studio — and then do my typical keep it 5-10 years. However, I don’t expect to make that choice/purchase until early next year.

The M3 Ultra’s base memory is 96GB. Will that be upped with the M5 Ultra?
Is that enough?
Is it worth it?
Still a lot of ¯\_(ツ)_/¯

EDIT: Or M5 Max 128GB as it appears the extra RAM capacity, especially because only three-quarters or so can be allocated/used by the GPU, will aid more than a bump in cores. The Max seems to have acceptable performance for even larger (>30B) models.
 
Last edited:
  • Like
Reactions: hovscorpion12
Considering the exorbitant prices Apple charges for a lot of RAM, calculate how long you could just keep using a subscription service over running a local model.
 
  • Like
Reactions: rb2112
Considering the exorbitant prices Apple charges for a lot of RAM, calculate how long you could just keep using a subscription service over running a local model.
There are other reasons than cost savings to run local LLMs... data protection is one of them. There is data that I don't want (or am not allowed) to feed to a public LLM.
 
If you want to run local LLMs, you should wait for the M5 Max or Ultra. There are massive performance improvements in the M5 series due to the neural accelerators. They will be 2x-6x faster for AI, depending on workload. Hopefully Apple will offer some pro-level M5 chips soon!

I'd say that 64GB ram is the absolute minimum for running Qwen3-Coder-Next. Memory for other apps will be tight once you load the model at q4 and fill up a good chunk of the context.
 
Last edited:
  • Like
Reactions: Black.Infinity
When I've had similar quandaries, I've often realized later that I was right but early in my view. If you're thinking about this question in anticipation of the launch, perhaps take a breath and ask yourself what meaningful metric would give you comfort to invest in your next piece of hardware for as long as you anticipate needing it. For example: would a token per second threshold make the local development approach viable? Is that something benchmarks will likely reveal in the weeks after launch? I think others are right to note that memory is likely a key variable and the bottom line is that your hypothesis is hard to test now but likely won't be soon.
 
  • Like
Reactions: hovscorpion12
I tried different models on my 32gig. The 64gig models will be slightly better but none compare to having Claude opus. Anthropics infra and tech beats all the local llm for any complex tasks!

So pair your expectations vs the cost of maxing out a MacBook Pro and ram.
 
We have a decent GitHub CoPilot subscription at work, which is of great help in coding (but not only) tasks.

At home, I have a decent Mac Mini M4 Pro 14c/20c/64GB RAM build and playing with local LLMs via LM Studio is quite astonishing, what's possible locally already. Obviously, privacy and unlimited tokens being main factors here. But I consider myself as a very light-weight AI user for personal usage, thus haven't hit a limit yet in cloud free plan usage.

In LM Studio, also the newly available Qwen3 Coder Next 80B model popped up, thus downloaded it yesterday. With any 80B model, be it Qwen3 Coder Next or Qwen3 Next 80B, with my 64GB RAM Mac Mini M4 Pro, I'm in yellow memory pressure VERY quickly. The 80B model / LM Studio alone, according to Activity Monitor, is somewhere at 42GB+ RAM usage, I guess depending on how much tokens you have thrown in and out etc. Thus close to zero headroom if you want to have additional stuff running like an IDE, one or another docker container maybe etc., for a typical DEV setup. 64GB RAM is the absolute minimum for a local 80B model, IMHO, without considering additional memory intense processes, thus you may want 96GB+.

But local LLMs + surrounding tools are fun. You can get quite something out of it, e.g. even image generation. Although e.g. LM Studio doesn't support image generation OOTB, you can get some decent prompts, which you then can feed DiffusionBee with on a Mac. A recent LM Studio version even supports being the "runtime" / model host for a local Claude Code installation, thus you can point your local Claude Code installation to a local model hosted via your running LM Studio. Haven't tried that yet though.

But seeing what's possible with GitHub CoPilot at work, resp. at home, what's possible with some variant (stuff is confusing there edition-wise IMHO) of MS CoPilot included in our M365 Family subscription plan, which my kids are using for various things, I'm torn between local vs. jumping onto a cloud offering/subscription for personal usage, especially also after installing the Claude Desktop App on my Mac yesterday, with the Free plan so far, which gives me access to Sonnet 4.5. Got already great input while looking for a student home for my kid for autumn 2026. But likely the same with MS CoPilot.

Claude doesn't support image generation though, thus back to MS CoPilot in the M365 Family subscription or the local LM Studio + Diffusion Bee combo.

Claude privacy, not sure, there is some opt-out by disabling the "Help improve Claude" toggle, but can't say what still happens when I upload maybe personal documents in a chat etc ... something similar is available for MS CoPilot as well.

I'm torn between local and cloud, but I'm a very light-weight user for personal stuff at home, thus maybe a free plan of something is good enough, not like others, if they hit limits with Claude's MAX plan 😎. But for an Apple 64GB to 128GB RAM upgrade alone, you will get quite a few months of subscriptions somewhere, especially for not super heavy usage resp. a maxed out subscription plan. Also a bit the question how up-to-date data in local LLMs is, if one need to work on super recent stuff.

One super bonus for me with a cloud plan would be that I have my chats available everywhere, on-the-go, on client devices like an iPad etc. But as said, I'm torn between local vs. cloud, local, as I also want to utilize my 64GB RAM investment on my Mac Mini M4 Pro.
 
Last edited:
On my 96GB ram M3 Max I can run 14b devstral (MLX, in LM studio) with 128k context. I tried Mistral's vibe CLI for agentic coding, it barely managed to rewrite a simple C codebase (~ 300 LoC C codebase including headers and a small CMake file) using that model, it took it almost 10 minutes to accomplish that, when something like Gemini would do it in maybe 1/5th of time, if not quicker, not to mention it has 10 times the context. There's obviously benefits to running stuff locally and I am all for it, but just letting you know to temper your expectations if you are used to cloud AI.

Inference in general is pretty slow on Macs. Mostly useful if you are just chatting or generating images, where you don't mind waiting between replies and whatnot
 
On my 96GB ram M3 Max I can run 14b devstral (MLX, in LM studio) with 128k context. I tried Mistral's vibe CLI for agentic coding, it barely managed to rewrite a simple C codebase (~ 300 LoC C codebase including headers and a small CMake file) using that model, it took it almost 10 minutes to accomplish that, when something like Gemini would do it in maybe 1/5th of time, if not quicker, not to mention it has 10 times the context. There's obviously benefits to running stuff locally and I am all for it, but just letting you know to temper your expectations if you are used to cloud AI.

Inference in general is pretty slow on Macs. Mostly useful if you are just chatting or generating images, where you don't mind waiting between replies and whatnot
On this hardware you should instead run Qwen 3 Coder Next which is way smarter and faster than devstral will be (MoE 3b active parameter against 14b dense mode) you will see better result
 
I second Qwen3 Coder Next, this is the model that seriously challenges the online big beasts for the first time IMO. It's also a decent general model.

In a few years' time it'll hopefully be odd to think that we ever used the likes of Claude Code over local inference. We're not there yet but the gap is closing every few months it seems.
 
  • Haha
Reactions: UpsideDownEclair
I am all in on replacing cloud LLMs with self hosted ones. They are private and uncensored. But unfortunately neither open weight models, nor consumer hardware is quite there yet for it to be viable.

Qwen3-Coder-Next is 80B parameters, so it needs 159 GB of ram to just load the weights , hence 256 gb for minimum usage (context + KV cache need memory too).

https://huggingface.co/Qwen/Qwen3-Coder-Next/tree/main

On 64 gb machine you could only run 4bit quantised version which is suboptimal for coding. Maybe 8bit version on 128gb machine will be acceptable.

Anyway, even the full version is not as good as top open weights models like Kimi K2.5 and GLM5. For the former you need ~1Tb of ram, the latter 2Tb. So the only option is to run them on a cluster of 512gb Mac Studio (eg using exo https://exolabs.net). And those models are still not as good as codex 5.3/opus 4.6. Very impressive that it’s now possible on technically consumer hardware but paying $20-40k for a cluster is not quite financially viable instead of paying $100/m for Claude code. Not to mention that on Claude code you will have access to the latest model all the time, while the local cluster might need updating in a year or two to run local models.

I suggest you try using Qwen3-Coder-Next for you actual tasks and evaluate whether it can replace Claude for you. Then I suggest looking in at least 128gb machine. Try other top open models and see if any of them fits your needs better. Maybe getting a single 512gb Mac Studio is reasonable for you.

I think depending on which way the progress will with models local LLMs might become viable soon. If the new models are smaller and yet as capable as recent top proprietary models we might be able to run them on a single Mac Studio or even new MBP with 256gb (which will hopefully come soon).

However, the progress might go the other way and we will get bigger and bigger top open models. Latest GLM5 doubling the number of parameters is in indication of this trend. In this case we can only hope hardware will catch up faster or there are other optimisation.

Until then, cloud LLMs are unfortunately the more reasonable option unless you value privacy above all else.
 
LM studio tells me I can run up to 6bit quant of 80b Qwen Coder. Not sure I'd have enough RAM for 128k context. New to local LM , is there a rule of thumb for how much RAM/VRAM you need for X context window?
 
  • Haha
Reactions: UpsideDownEclair
I second Qwen3 Coder Next, this is the model that seriously challenges the online big beasts for the first time IMO. It's also a decent general model.

In a few years' time it'll hopefully be odd to think that we ever used the likes of Claude Code over local inference. We're not there yet but the gap is closing every few months it seems.
Tried the 4bit quant MLX through LM Studio. Pleased with the speed, but it struggled quite a bit with a simple C++ project + Qt6. Adding a button that does something using existing code took it like an hour since I had to constantly clear the session once it got stuck and looped on compilation errors. I am sure it would one shot writing a Vue3 / React web portfolio though. I should do that next since I've been meaning to do it for a while now
 
  • Haha
Reactions: UpsideDownEclair
I got an M5 Pro now with 64GB of RAM (so the 18 cpu/20gpu core chip) and I am running a local setup with LM Studio (in server mode) running the qwen3.5-35b 8bit model (MLX version) plus AnythingLLM with a RAG based on a LanceDB vector database. It delivers pretty good results (maybe not as good as the current frontier models) but definitely usable to discuss software architecture and things like that over a longer period of time with documents embedded in the RAG.

Consumes roughly 40GB of memory and when prompts are processed, the GPU ist at 100% but it's usable and I can still work in IntelliJ, have some Docker running and stuff at the same time. Fans are kicking in pretty loudly though 😅

Edit: I don't use it for coding, I have a Codex license for that, but I am not allowed to discuss client information and data with Codex, so that's where a local LLM comes into play.

Bildschirmfoto 2026-03-24 um 10.54.01.png
 
Last edited:
I got an M5 Pro now with 64GB of RAM (so the 18 cpu/20gpu core chip) and I am running a local setup with LM Studio (in server mode) running the qwen3.5-35b 8bit model (MLX version) plus AnythingLLM with a RAG based on a LanceDB vector database. It delivers pretty good results (maybe not as good as the current frontier models) but definitely usable to discuss software architecture and things like that over a longer period of time with documents embedded in the RAG.

Consumes roughly 40GB of memory and when prompts are process, the GPU ist at 100% but it's usable and I can still work in IntelliJ, have some Docker running and stuff at the same time. Fans are kicking in pretty loudly though 😅
can anyone say anything on performance gains on the m5 series for comfy ui image generation? It's been absolute garbage on my m2 ultra mac studio, wayyyy too slow in comparison to a decent pc. Would be great if the new accelerators and stuff would help with that 🙂
 
  • Haha
Reactions: UpsideDownEclair
can anyone say anything on performance gains on the m5 series for comfy ui image generation? It's been absolute garbage on my m2 ultra mac studio, wayyyy too slow in comparison to a decent pc. Would be great if the new accelerators and stuff would help with that 🙂
M5 series have very good performance gain for image generation.
from apple :
  • Up to 7.8x faster AI image generation performance when compared to MacBook Pro with M1 Pro, and up to 3.7x faster than MacBook Pro with M4 Pro.

 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.