Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
That's out of date information. It can be adjusted using sysctl iogpu on the Terminal. SeeView attachment 2479686

Just because you can adjust it doesn't mean the other info is out of date. If you don't adjust it the previous information is accurate. In games like Shadow of Tomb Raider the VRAM of my M1 Max with 32GB RAM is shown as 21. throAU said they see 48GB as VRAM out of 64 so Apple's information is still valid unless you have a link to new information.
 
Just because you can adjust it doesn't mean the other info is out of date. If you don't adjust it the previous information is accurate. In games like Shadow of Tomb Raider the VRAM of my M1 Max with 32GB RAM is shown as 21. throAU said they see 48GB as VRAM out of 64 so Apple's information is still valid unless you have a link to new information.
It's out of date, because mac LLM software allow you to use more than the default 75%. (Not talking about games in this thread.)
 
  • Like
Reactions: flybass
i'm just wondering if the maximum memory capacity of M4 Max could be doubled considering AMD had achieved 128G using 4 channels, while M4 Max has 8 channels.
By having 256G, a full size 671B DeepSeek R1 with 2.51bit dynamic quantization could be fit into it, with great accuracy improvement compared to 70B distilled version.
That would be great.
 
FYI, model Qwen2.5-72B-Instruct, does not run well on a M4 Mac Mini 64GB/1TB
It won't even load without the loading restrictions lessened.
Not worth the pain, nor possible system crash/corruption.
Just sayin'...
 
FYI, model Qwen2.5-72B-Instruct, does not run well on a M4 Mac Mini 64GB/1TB
It won't even load without the loading restrictions lessened.
Not worth the pain, nor possible system crash/corruption.
Just sayin'...
You can try running deepseek r1 distilled llama 70B with Q4 quantization. Should be running well.
 
  • Like
Reactions: buggz
Genuine question - what exactly are you running your LLM for? I mean, what's your use case for it?
I find the topic interesting, but struggle to see what I would use a LLM for if I were to ever set one up.....

I use one to generate analysis to a survey, using python to generate the prompt. It does a decent job of highlighting the results, for a simple 2 paragraph summary which I then display on a web page generated from django. I would not use it o generate a standard alone report for a client, but a summary at the end of the survey adds to the results displayed.
 
You can try running deepseek r1 distilled llama 70B with Q4 quantization. Should be running well.
Ahh, I was using the "full on" version of the, Qwen2.5-72B-Instruct, model, not one of the sub quantizations.
 
Ahh, I was using the "full on" version of the, Qwen2.5-72B-Instruct, model, not one of the sub quantizations.
I find the performance of an LLM degrades quickly with the amount of quantization despite the same number of parameters. Q2 and Q3 should be avoided IMHO. Q4 minimum. Q6 and above seem not so different from the fp16 or bfloat16 versions.
 
  • Like
Reactions: buggz
I guess I should say my 64GB/1TB M4 Mac Mini is the Pro version.
This makes a difference in my information.
 
I find the performance of an LLM degrades quickly with the amount of quantization despite the same number of parameters. Q2 and Q3 should be avoided IMHO. Q4 minimum. Q6 and above seem not so different from the fp16 or bfloat16 versions.
Good to know, thanks!
 
FYI...

I just tried, DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf, on a 64GB M4 Mac Mini Pro, seemed slow.
 
FYI...

I just tried, DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf, on a 64GB M4 Mac Mini Pro, seemed slow.
it is. And context length (longer context length requires tons of ram, deepseek r1 supports 163K if I remember correctly, which would require additional 800G of RAM, what we have running locally are probably restricted in 7k something) is also restricted as well. These LLM requires a lot of ram and computation power
 
  • Like
Reactions: buggz
it is. And context length (longer context length requires tons of ram, deepseek r1 supports 163K if I remember correctly, which would require additional 800G of RAM, what we have running locally are probably restricted in 7k something) is also restricted as well. These LLM requires a lot of ram and computation power
I’m interested in learning more about context length.

I’m using LLM to help brainstorm my novels and the main issue I have is the short context length of LLMs. It makes them almost useless for my purposes because the LLM quickly forgets early parts of the discussion - so it ends up suggesting and recommending things that have already been discussed and written.

I don’t have a huge amount of RAM right now (just 24GB) and, when I tried increasing context length too much on a 10B model, my MBP quickly went unstable and crashed/rebooted.

I’m hoping that if I can go for a Studio with a good chunk of RAM in the future, I’ll be able to use a semi-decent size LLM with a good size context length. I’m seeking additional info on context length and how much RAM is required.
 
I have the high-end M4 Pro mini, maxed out except for the storage, which is 2TB. The best models I run comfortably are the Qwen2.5 versions in 32b size. While the inference speed at 11 tokens per second isn’t bad, I usually opt for the 14b versions because they’re twice as fast and good enough for most tasks. If I need better results, I switch to the 32b models or use something larger via OpenRouter (DeepSeek V3 is great as last resort and ridiculously cheap).

I rely on Ollama for things like text improvement, summarization, translation, code refactoring, and autocompletion. For the majority of my tasks, I use the excellent Bolt AI app, which I absolutely love because it lets me trigger a contextual menu with keyboard shortcuts to quickly access many useful prompts that can also be customized. Who else is using this app?
 
i'm just wondering if the maximum memory capacity of M4 Max could be doubled considering AMD had achieved 128G using 4 channels, while M4 Max has 8 channels.
By having 256G, a full size 671B DeepSeek R1 with 2.51bit dynamic quantization could be fit into it, with great accuracy improvement compared to 70B distilled version.
That would be great.
I believe the Mac Pro will not have an M4 and will probably provide a lot of the RAM required but I’d expect it to cost between $10,000-$15,000.

There’s a reasonable chance the Studio will get 256GB, very unlikely chance it’ll get 384GB but I doubt they go higher than either of those on the M4 Ultra.

I believe memory capacity will be the gating factor to upsell people into the Mac Pro / Hidra platform, and it probably will work assuming they improve Matrix operations on the GPU and get the inference speed up a bit. I think they can sell 10x the amount they have been selling if they play their cards right.

Or if you’re OK with linking multiple computers MLX already supports distribution and 3 M2 Ultras will run the full R1 model at 4 bit today for about $20k.
 
There’s a reasonable chance the Studio will get 256GB, very unlikely chance it’ll get 384GB but I doubt they go higher than either of those on the M4 Ultra, I believe memory will be the gating factor to upsell people into the Mac Pro / Hidra platform, and it probably will work assuming they improve Matrix operations on the GPU and get the inference speed up a bit.
I think the M4 Ultra Studio will support more than the 192GB of the M2 generation.
256GB is likely, but at what price? Still, two Studios (= 512GB) may be cheaper than whatever big iron platform is coming down the pike. LLM can be split a la EXO, some layers on one Studio, the rest on another, and communication using Thunderbolt 5 networking.
 
  • Like
Reactions: novagamer
I think the M4 Ultra Studio will support more than the 192GB of the M2 generation.
256GB is likely, but at what price? Still, two Studios (= 512GB) may be cheaper than whatever big iron platform is coming down the pike. LLM can be split a la EXO, some layers on one Studio, the rest on another, and communication using Thunderbolt 5 networking.
Yep, see this post by an Apple ML researcher doing exactly that: https://t.co/RETLvpHdVr

If you want one computer you’re going to pay through the nose but it is a lot more convenient and less hassle, though the workloads aren’t too difficult to distribute so it will be a difficult decision for some buyers.
 
If you want one computer you’re going to pay through the nose but it is a lot more convenient and less hassle, though the workloads aren’t too difficult to distribute so it will be a difficult decision for some buyers.
Pricing-wise, there has always been a divide between consumer and workstation/enterprise markets. If Apple doesn't charge too much for the 256GB RAM configuration, e.g. being able to get it with minimum SSD, buying one in 2025. And adding another one as funds become available may be a very good strategy. For those just issuing company purchase orders, by all means, go straight for the big yet-to-be-revealed 512GB or 1TB machine.
 
  • Like
Reactions: novagamer
Does anyone use koboldcpp? I wanted to run Deepseek r1 with speculative decoding (say 7B & 32B setup), but found I could not get the 2 models running in parallel. Staring at the logs, I saw something about metal not supporting asynchronous calls. As such, it was slower than just running 32B.
 
At the moment, I’m putting some serious thoughts into getting a more powerful machine. I’m looking ahead to a potential M4 Max Studio when it’s released - with something like 96GB RAM (or even more). I need to be sure I can justify the hit on my finances before I do so, so your experience using an M4 Max MacBook Pro with 128GB Ram is something I’m looking forward to reading about.

Don't worry about it! You can stop anytime you want.
 
None of the models you can run on a laptop are Deepseek. They are Qwen and llama architecture models distilled by Deepseek.

If you run the ‘ollama show modelname’ command you will see the architecture of the model.
 
  • Like
Reactions: JSRinUK
None of the models you can run on a laptop are Deepseek. They are Qwen and llama architecture models distilled by Deepseek.

If you run the ‘ollama show modelname’ command you will see the architecture of the model.
not sure who your replying to, if me, yes, I'm aware they are distillations of r1.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.