Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

OldMike

macrumors 6502a
Original poster
Mar 3, 2009
549
237
Dallas, TX
When the Mac Studios were released, I had a tough time deciding which model to get. It mostly came down to a M4 Max 128GB 16/40 or a M3 Ultra 96GB 28/60. Micro Center was having a great deal on the M3 Ultras and I ended up getting one right before the deal ended (I also ended up picking up a base Mac Studio M4 Max to replace an older Mac Mini).

Part of the decision was deciding what would be better for LLM usage - the base M3 Ultra or the 128GB M4 Max. Since the price was extremely close due to the deal Micro Center was having - it made the decision especially difficult. In a normal year, where the Max and Ultra would be of the same generation - the decision wouldn't have been as difficult. I ended up going with the M3 Ultra, because I believe in most scenarios it would outperform the M4 Max for LLM usage.

I mostly use this as a backend for serving LLMs to VS Code with the Continue.dev extension. I also have a ProxMox Linux VM running Open WebUI that is pointing at the M3 Ultra as well. I am using LM Studio to serve the OpenAI API service.

I ran some simple performance benchmarks on the M3 Ultra, and just wanted to post these results in case it helps anyone. I used an extremely simple prompt - so that I could see what the best case scenario would be for each size model. More complex prompts could come back with vastly different results - and I wanted to limit that possibility as much as possible so that I could make a better comparison.

I hope if someone is on the fence in making a purchase, this information can help in some way.


Mac Studio M3 Ultra 96GB 28/60​

MLX Q4 - except as otherwise noted
SpecD-1.7B is speculative decoding with 1.7B model
Qwen3 models were run with reasoning turned off (tokens/sec are very similar either way)

Prompt: Explain daylight savings time


ModelTok/SecTotal TokensFirst TokenDisk Size
Deepseek R1 Qwen 7B128.5810700.76s4.30 GB
Deepseek R1 Qwen 14B66.819000.35s8.32 GB
Deepseek R1 Qwen 32B33.259010.53s18.44 GB
Deepseek R1 Llama 70B16.459530.39s39.71 GB
Phi-4 14B71.094420.11s8.26 GB
Gemma-3 27B33.8913061.17s16.87 GB
QWQ 32B33.0410210.54s18.45 GB
Llama 4 Scout 17B-16E44.675723.75s61.14 GB
Qwen3 1.7B276.204740.05s984 MB
Qwen3 4B168.685380.07s2.28 GB
Qwen3 8B114.933600.08s4.62 GB
Qwen3 14B70.335600.04s8.32 GB
Qwen3 14B Q841.516140.07s15.71 GB
Qwen3 32B33.885070.08s18.45 GB
Qwen3 32B Q820.115460.12s34.83 GB
Qwen3 32B Q8 (SpecD-1.7B)22.364930.13s34.83 GB
Qwen3 32B BF1610.766600.21s65.54 GB
Qwen3 32B BF16 (SpecD-1.7B)17.576480.16s65.54 GB
Qwen3 30B-A3B94.953890.04s17.19 GB
Qwen3 30B-A3B Q875.845980.05s32.46 GB
Qwen3 30B-A3B BF1662.624710.06s61.08 GB
 
Last edited:
I was on the fence when buying my Studio. I went for the M4 Max in the end because of the 128GB RAM.

I’ve recently been playing with qwen 3 235b (which can be squeezed into 88GB). I wouldn’t be doing that with 96GB RAM.

If the base M3 Ultra had been 128GB rather than 96GB, it would have been a whole different story for me.
 
If I would have known that Qwen3 would have had a highly performant MoE model that would have fit in the 32GB RAM difference, it might have pushed me to custom order the M4 Max with 128GB.

The problem is that the full sized 235B model is over 500GB and is only marginally better than the full size 32B, though it will probably run faster. Once 235B gets quantized down it loses its advantage in the results it provides. So for me, with what's available today, I'm not sure how much an advantage the 128GB vs 96GB would be vs having faster inference speed (in most cases) with the Ultra.

I'm curious how fast 235B runs on the M4 Max, I'm sure it is quite quick, despite being so large.

I was very surprised at how fast any machine that could fit 30B into VRAM can run. For coding, the results I am getting with the 30B non quantized model is not as good as 32B with Q4 quantization or above.

The problem I have with the Qwen3 models is how long the reasoning process is. On a semi complex coding question, I have had thinking times up to 47 minutes on different Qwen3 models. Sometimes during reasoning, I've actually had some of the models get stuck in a loop. Switching reasoning off for 32B does not affect the output nearly as much as switching reasoning off on the MoE models.

Despite how fast the 30B runs on the Mac Studio, for now I have pretty much settled on running Qwen3-32B full sized model (around 64GB) when I want the most accuracy, and I will run 32B Q4 if I want a little more speed. For non coding usage, I will probably also use 30B because of how speedy it is.
 
  • Like
Reactions: Zubba
I have one of the Qwen/QwQ models (Qwen_QwQ-32B-IQ2_S) on my 24GB MacBook Pro and, for my purposes (brainstorming my story), I’ve been pretty impressed with it. Previously I hadn’t gone lower than 4-bit and any 2-bit model I tried was pointless.

I run Qwen3-235B-A22B-UD-Q2_K_XL on my Studio, but only when I’m not doing much else with it. I’ve used a lot worse models in my time.
 
  • Like
Reactions: OldMike
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.