I don't really know the true limit using sysctl iogpu.
Some people say running headless and login remotely using ssh will permit more memory to the GPU.
I don't really know the true limit using sysctl iogpu.
Some people say running headless and login remotely using ssh will permit more memory to the GPU.
That's out of date information. It can be adjusted using sysctl iogpu on the Terminal. See
That's out of date information. It can be adjusted using sysctl iogpu on the Terminal. SeeView attachment 2479686
It's out of date, because mac LLM software allow you to use more than the default 75%. (Not talking about games in this thread.)Just because you can adjust it doesn't mean the other info is out of date. If you don't adjust it the previous information is accurate. In games like Shadow of Tomb Raider the VRAM of my M1 Max with 32GB RAM is shown as 21. throAU said they see 48GB as VRAM out of 64 so Apple's information is still valid unless you have a link to new information.
You can try running deepseek r1 distilled llama 70B with Q4 quantization. Should be running well.FYI, model Qwen2.5-72B-Instruct, does not run well on a M4 Mac Mini 64GB/1TB
It won't even load without the loading restrictions lessened.
Not worth the pain, nor possible system crash/corruption.
Just sayin'...
Genuine question - what exactly are you running your LLM for? I mean, what's your use case for it?
I find the topic interesting, but struggle to see what I would use a LLM for if I were to ever set one up.....
Ahh, I was using the "full on" version of the, Qwen2.5-72B-Instruct, model, not one of the sub quantizations.You can try running deepseek r1 distilled llama 70B with Q4 quantization. Should be running well.
I find the performance of an LLM degrades quickly with the amount of quantization despite the same number of parameters. Q2 and Q3 should be avoided IMHO. Q4 minimum. Q6 and above seem not so different from the fp16 or bfloat16 versions.Ahh, I was using the "full on" version of the, Qwen2.5-72B-Instruct, model, not one of the sub quantizations.
Good to know, thanks!I find the performance of an LLM degrades quickly with the amount of quantization despite the same number of parameters. Q2 and Q3 should be avoided IMHO. Q4 minimum. Q6 and above seem not so different from the fp16 or bfloat16 versions.
it is. And context length (longer context length requires tons of ram, deepseek r1 supports 163K if I remember correctly, which would require additional 800G of RAM, what we have running locally are probably restricted in 7k something) is also restricted as well. These LLM requires a lot of ram and computation powerFYI...
I just tried, DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf, on a 64GB M4 Mac Mini Pro, seemed slow.
I’m interested in learning more about context length.it is. And context length (longer context length requires tons of ram, deepseek r1 supports 163K if I remember correctly, which would require additional 800G of RAM, what we have running locally are probably restricted in 7k something) is also restricted as well. These LLM requires a lot of ram and computation power
I believe the Mac Pro will not have an M4 and will probably provide a lot of the RAM required but I’d expect it to cost between $10,000-$15,000.i'm just wondering if the maximum memory capacity of M4 Max could be doubled considering AMD had achieved 128G using 4 channels, while M4 Max has 8 channels.
By having 256G, a full size 671B DeepSeek R1 with 2.51bit dynamic quantization could be fit into it, with great accuracy improvement compared to 70B distilled version.
That would be great.
I think the M4 Ultra Studio will support more than the 192GB of the M2 generation.There’s a reasonable chance the Studio will get 256GB, very unlikely chance it’ll get 384GB but I doubt they go higher than either of those on the M4 Ultra, I believe memory will be the gating factor to upsell people into the Mac Pro / Hidra platform, and it probably will work assuming they improve Matrix operations on the GPU and get the inference speed up a bit.
Yep, see this post by an Apple ML researcher doing exactly that: https://t.co/RETLvpHdVrI think the M4 Ultra Studio will support more than the 192GB of the M2 generation.
256GB is likely, but at what price? Still, two Studios (= 512GB) may be cheaper than whatever big iron platform is coming down the pike. LLM can be split a la EXO, some layers on one Studio, the rest on another, and communication using Thunderbolt 5 networking.
Pricing-wise, there has always been a divide between consumer and workstation/enterprise markets. If Apple doesn't charge too much for the 256GB RAM configuration, e.g. being able to get it with minimum SSD, buying one in 2025. And adding another one as funds become available may be a very good strategy. For those just issuing company purchase orders, by all means, go straight for the big yet-to-be-revealed 512GB or 1TB machine.If you want one computer you’re going to pay through the nose but it is a lot more convenient and less hassle, though the workloads aren’t too difficult to distribute so it will be a difficult decision for some buyers.
At the moment, I’m putting some serious thoughts into getting a more powerful machine. I’m looking ahead to a potential M4 Max Studio when it’s released - with something like 96GB RAM (or even more). I need to be sure I can justify the hit on my finances before I do so, so your experience using an M4 Max MacBook Pro with 128GB Ram is something I’m looking forward to reading about.
not sure who your replying to, if me, yes, I'm aware they are distillations of r1.None of the models you can run on a laptop are Deepseek. They are Qwen and llama architecture models distilled by Deepseek.
If you run the ‘ollama show modelname’ command you will see the architecture of the model.