Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

forrie

macrumors regular
Original poster
Mar 6, 2008
179
159
I have a Macbook Pro M4 MAX that is "maxed" (get it? lol) 128GB RAM and 8TB, fully loaded. I am interested in running some of the advanced LLMs (DeepSeek comes to mind) and I wonder what others have for experience here. I am a "newbie" to running the LLM locally, but I am a regular user of ChatGPT (and I've found that useful).

One thing noted is that, thus far, there doesn't appear to be a way to expand the system with an attached array of GPUs; something the Intel Macs were able to do. I'm going to imagine Apple is well aware of this limitation -- but I'd like to understand more about this, and whether anyone has heard of plans to allow attached GPU arrays to allow and augment better performance.

I think DeepSeek is disrupting enough that it will grant hope that we won't need to spend tons of money on this. And moving forward.
 
  • Like
Reactions: buggz
I use a M1 Max 64GB. You can run decent sized LLMs, but notice they come in a variety of "quants". This means the original weights have been compressed in a lossy compression scheme, e.g. Q4 means they use 4 bits to encode what was 16fp (or 32fp) = 16 bit floating point, along with appropriate block min and scaling, Q2 means 2 bits and so on. With lossy encoding, accuracy decreases and hallucinations increase as fewer bits are used.

Try ollama, which is a wrapper for llama.cpp. Lots of models available freely. So you can download the largest possible .gguf model that fits in your GPU-allocated memory. My 64GB M1 Max can run a Q4_K 70B (parameter) 43GB model no problem with reasonable token/sec generation, but on a larger model, e.g. Q6 or Q8, it cannot fit in GPU-allocated memory, so it is very slow at generating output. Your 128MB machine probably can run both a Q6 or Q8 in GPU-memory. And can get higher quality results. Your machine is probably not big enough to run the fp16 model.

EXO can help you run larger models simply by adding more machines. The larger model is split by layers across the machines. So you see people running a stack of Mac Minis. Communication is a bottleneck as results have to passed onto the next machine running the next layers, so some people use Thunderbolt 5 networking. So probably the fp16 model can run for highest quality results.
 
  • Like
Reactions: haralds and buggz
Thank you for responding. I had not heard of EXO before. I have at least one Macbook Pro that is Intel Based, which I could technically connect an external GPU "box" (I forgot the term). But, I am still wondering about the limitation of connecting something like this to Apple Silicon - I read an explanation a while ago, that pointed at a specific design strategy.
 
Thank you for responding. I had not heard of EXO before. I have at least one Macbook Pro that is Intel Based, which I could technically connect an external GPU "box" (I forgot the term). But, I am still wondering about the limitation of connecting something like this to Apple Silicon - I read an explanation a while ago, that pointed at a specific design strategy.

It's physically possible to connect an external GPU box to an Apple Silicon-based Mac Mini via Thunderbolt 3/4/5 but it wouldn't be visible to the software and therefore not used. One would have to write an entirely separate stack of software from libraries down to the kernel drivers to access it, and likely the kernel drivers would have to be signed by Apple and likely they wouldn't as it's not aligned with their roadmap for the platform.

Apple's forseeable vision for the Mac platform includes an all-Apple stack of software and hardware down to the RAM which is integrated onto the Apple Silicon package. The latter is a shared resource across the CPU, GPU, and NPU among other units in the package. This has performance advantages for some situations but optimizes the system for that rather than working with external/dGPU. Apple would probably argue that the niche that needs an external/dGPU is already better served by Linux anyway.

On the plus side, Apple's approach is very cost effective for running large memory LLM locally. A Max with 128GB of RAM can allocate over 90GB of RAM to the GPU and an nVidia GPU with that much VRAM costs a lot more than an entire Max. Generally an nVidia-based platform is faster when working within a given amount of VRAM, but it's also much more expensive for a given amount of VRAM beyond 8 or 16GB.
 
It could also be that Apple has a solution they are developing, to market, in the near future, as GPU arrays are useful for a number of applications (scientific, etc). This seem in line with their direction, including 5G modems etc.
 
  • Like
Reactions: leifp and Tagbert
I hope you’ll continue updating this thread as you explore LLMs.

I started experimenting with LLMs almost by accident. I didn’t buy my Macbook Pro (unbinned M4 Pro, 24GB/1TB) with LLMs in mind and only started playing about with them after watching some YouTube videos on the subject.

I’ve tried pushing my system to uncomfortable levels (27B gemma2) and have now settled on a more usable 9B gemma2 with docs added to Ollama’s Knowledge section. It’s been an interesting exercise to brainstom novel-writing with the LLM.

I’ve also been comparing what ChatGPT brings to the table as something that’s much, much larger (ChatGPT is generally better than my 9B local LLM, but it’s not orders of magnitude better). I’ve also experimented with running a smaller LLM on my old Late-2014 Mac Mini, and also on my old Surface Pro 6. It’s been good to compare.

At the moment, I’m putting some serious thoughts into getting a more powerful machine. I’m looking ahead to a potential M4 Max Studio when it’s released - with something like 96GB RAM (or even more). I need to be sure I can justify the hit on my finances before I do so, so your experience using an M4 Max MacBook Pro with 128GB Ram is something I’m looking forward to reading about.
 
  • Like
Reactions: buggz
download lmstudio and play

lmstudio.ai

on your machine maybe try llama 3.3 70b parameters as a model that will fit your machine comfortably and generate decent results. will need around 30-40 gb of RAM.
 
  • Like
Reactions: buggz
You can connect multiple Mac’s to pool the memory. I know some one who has cluster of 4 M2 Ultra with 192GBx4 (760GB unified memory). You can connect them with Etgernet(slower) or Thunderbolt( slightly better). It also depends on the model you load, some may need 500-700 GB RAM but may actively need about 50-60 GB.
 
  • Like
Reactions: leifp
download lmstudio and play

on your machine maybe try llama 3.3 70b parameters as a model that will fit your machine comfortably and generate decent results. will need around 30-40 gb of RAM.
I have both Ollama and LM Studio. The same model at the same level of quantization, Llama 3.3 Q4_K I think it was, one will run in Ollama but refuse to run LM Studio because of RAM limits despite messing with sysctl iogpu. It seems Ollama is more efficient in terms of GPU RAM.
 
  • Like
Reactions: buggz
I have been running ollama on a Dell workstation with 128G RAM and a decent nVIDIA RTX A5000 GPU with 24G VRAM for over a year. I also installed open web-UI for the chat interface. It's a great combination and it indeed helped my workflow.

But with only 24G VRAM, it limited the models I could run. For example, Gemma2 27b works fine, but llama-3.3 doesn't even load (ollama tries to load it into CPU RAM and it takes forever).

Last week, I purchased a new MBP 14 with 128G ram and finally I could run larger models. I installed ollama on the host and installed open web-UI in Docker, so that I could keep my old workflow.

As for performance, with llama-3.3 which has 70b parameters, inference speed was about 10 to 12 tokens/s. It's useable. Anything below that would be too slow.

But when you run a large model for the first time, it would take some time to load.

1738276366075.png


For deepseek-r1, the 70b model was slower, barely touching the 10 token/s mark.

1738276620838.png


Because of the reasoning process, it's usually much slower to give answers. For the simple `tell me a joke` test, there is no thinking process, so it's relatively faster.

If you ask it a more complicated question, such as a math question, it would struggle more.

For memory usage, during inference, the 70b models could use about 40G-50G memory. After the completion, the memory would be released. Ollama seems very good at managing the usage on-demand.

During inference:

1738276945286.png


After inference:

1738276994836.png


70b models are pretty much the highest you can go. That's also quantized with FP4. With FP8 or higher precisions, the memory usage could go much higher, and performance much lower.

As for potential usage of local LLM, you can develop your own apps, or deploy 3rd party ones as long as they are configurable for local LLMs.
 

Attachments

  • 1738276586496.png
    1738276586496.png
    203.8 KB · Views: 88
But with only 24G VRAM, it limited the models I could run. For example, Gemma2 27b works fine, but llama-3.3 doesn't even load (ollama tries to load it into CPU RAM and it takes forever).
I tried running gemma2:27b on my MacBook Pro (unbinned M4, 24GB/1TB) via Ollama and it was too slow to be usable. The response_tokens/s was just 1.88 (although I believe I’ve got as high as 10t/s - so I’m guessing it depends what else my laptop is doing, and maybe what I’m asking it). If I run a model based on gemma2:latest (9.2B), I get 28.57 t/s - which is what I use most of the time, and I’m able to run that while also running AI image generation (but I am pushing it when doing both).

My hope for the future is an M4 Max Studio with as much RAM as my budget will stretch to. I’m intrigued by what else I can do with local LLMs, and I’m not sure I’ll be able to progress very far with just my MBP. For this reason, I’m closely following anyone who, like you, has this kind of spec already.
 
Speed requirement does depend on what you’re doing. Sure, it might be slow but if you can generate code that would take you a bunch of time to get right in an unfamiliar language for example, maybe still worth it.
 
My hope for the future is an M4 Max Studio with as much RAM as my budget will stretch to. I’m intrigued by what else I can do with local LLMs, and I’m not sure I’ll be able to progress very far with just my MBP.
I guess we will see 256GB or higher with the M4 Ultra that is coming with lots of GPU cores. Maybe that'll be $7,000.
 
Speed requirement does depend on what you’re doing. Sure, it might be slow but if you can generate code that would take you a bunch of time to get right in an unfamiliar language for example, maybe still worth it.
I've been trying larger models, with nothing of significance running (just AeonTimeline, Markdown One, and my browsers). gemma2:27b has actually been quite usable, so that's a lesson I've learned.

I have also been playing about with context length. While the 9B (and even a 13B) model is okay with a longer context length, gemma2:27b had a big wobbly. I got "500" errors and no response in Ollama. I've brought the context length back down (but still higher than the default), and it causes the model to take quite some time to respond, the response rate is slow, and it's the first time I've seen all of the performance cores on my MBP leap to maximum. The memory also jumps into the red each time it tries to return a token.

Well, I didn't really expect much else - but at least I know what the limits of my MBP are.
 
  • Like
Reactions: throAU
You can connect multiple Mac’s to pool the memory. I know some one who has cluster of 4 M2 Ultra with 192GBx4 (760GB unified memory). You can connect them with Etgernet(slower) or Thunderbolt( slightly better). It also depends on the model you load, some may need 500-700 GB RAM but may actively need about 50-60 GB.
Can you explain the approach is more detail for pooling memory that you know has worked?
 
Can you explain the approach is more detail for pooling memory that you know has worked?
The idea is simple, you have a bunch of layers, put some of the layers on one machine and other layers on another in pipeline fashion. There is free software to do this here https://exolabs.net/. The disadvantage is that the results of one machine has to be sent to the next machine in the pipeline. So, it's always slower than running on one machine as it's serial. But you can run a larger model. However, if you have lots of queries, you can take advantage of pipeline parallelism to batch the job, so one machine is not idle waiting for the results from the previous machine to begin work.
 
  • Like
Reactions: applesed
Can you explain the approach is more detail for pooling memory that you know has worked?
The idea is simple, you have a bunch of layers, put some of the layers on one machine and other layers on another in pipeline fashion. There is free software to do this here https://exolabs.net/. The disadvantage is that the results of one machine has to be sent to the next machine in the pipeline. So, it's always slower than running on one machine as it's serial. But you can run a larger model. However, if you have lots of queries, you can take advantage of pipeline parallelism to batch the job, so one machine is not idle waiting for the results from the previous machine to begin work.
It works fast with MOE models, Mixture of Expert models. Basically a huge model but can load and run a smaller model with in to run. Apple Intelligence is a good example though it fits in smaller devices. Deep seek also leverages MOE models. Basically it’s bunch of 40 GB expert models in a 405GB 671 B model. You can also try to batch different queries to load different expert models on different clusters or batch them for better performance.
Apple MLX team is pretty up to date with running the models and working with others optimizing the performance. Follow this Apple MLX engineer on X.
I love their swift MLX examples and code, lot of it I used with my iPad Pro test bed, though lot of it is for Mac but modifying for iPad OS or iOS isn’t big deal.
 
Last edited:
Genuine question - what exactly are you running your LLM for? I mean, what's your use case for it?
I find the topic interesting, but struggle to see what I would use a LLM for if I were to ever set one up.....
 
Last edited:
  • Like
Reactions: Flyview and buggz
Genuine question - what exactly are you running your LLM for? I mean, what's your use case for it?
I find the topic interesting, but struggle to see what I would use a LLM for if I were to ever set one up.....
I started looking at LLM because I knew nothing about it. When I discovered my MBP could run them, I was eager to learn.

To start with, I ran a few chapters of old stories I’d written through different LLMs to see if they could offer new perspectives.

Then I decided to write a new story, using the LLM to offer perspectives, suggestions, and research for things I didn’t know.

It’s all just part of learning about them. I’m just fascinated by it all.
 
  • Like
Reactions: Ifti
Wow, my M4 Mac Mini Pro, 64GB/1TB, runs the Qwen2.5-Coder-32B-Instruct model in LM Studio with no problems.
So easy.
I just turn up the fan speed to not let it get thermally saturated.
This model uses GPU Cluster blocks to the max at times, but otherwise okay.
Cool.
 
  • Like
Reactions: JSRinUK
Thanks to its large unified memory M4 Max runs circles around RTX 4090 and even 6000 ADA in 72b LLM Qwen.

1738807110670.png


 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.