When the Mac Studios were released, I had a tough time deciding which model to get. It mostly came down to a M4 Max 128GB 16/40 or a M3 Ultra 96GB 28/60. Micro Center was having a great deal on the M3 Ultras and I ended up getting one right before the deal ended (I also ended up picking up a base Mac Studio M4 Max to replace an older Mac Mini).
Part of the decision was deciding what would be better for LLM usage - the base M3 Ultra or the 128GB M4 Max. Since the price was extremely close due to the deal Micro Center was having - it made the decision especially difficult. In a normal year, where the Max and Ultra would be of the same generation - the decision wouldn't have been as difficult. I ended up going with the M3 Ultra, because I believe in most scenarios it would outperform the M4 Max for LLM usage.
I mostly use this as a backend for serving LLMs to VS Code with the Continue.dev extension. I also have a ProxMox Linux VM running Open WebUI that is pointing at the M3 Ultra as well. I am using LM Studio to serve the OpenAI API service.
I ran some simple performance benchmarks on the M3 Ultra, and just wanted to post these results in case it helps anyone. I used an extremely simple prompt - so that I could see what the best case scenario would be for each size model. More complex prompts could come back with vastly different results - and I wanted to limit that possibility as much as possible so that I could make a better comparison.
I hope if someone is on the fence in making a purchase, this information can help in some way.
SpecD-1.7B is speculative decoding with 1.7B model
Qwen3 models were run with reasoning turned off (tokens/sec are very similar either way)
Prompt: Explain daylight savings time
Part of the decision was deciding what would be better for LLM usage - the base M3 Ultra or the 128GB M4 Max. Since the price was extremely close due to the deal Micro Center was having - it made the decision especially difficult. In a normal year, where the Max and Ultra would be of the same generation - the decision wouldn't have been as difficult. I ended up going with the M3 Ultra, because I believe in most scenarios it would outperform the M4 Max for LLM usage.
I mostly use this as a backend for serving LLMs to VS Code with the Continue.dev extension. I also have a ProxMox Linux VM running Open WebUI that is pointing at the M3 Ultra as well. I am using LM Studio to serve the OpenAI API service.
I ran some simple performance benchmarks on the M3 Ultra, and just wanted to post these results in case it helps anyone. I used an extremely simple prompt - so that I could see what the best case scenario would be for each size model. More complex prompts could come back with vastly different results - and I wanted to limit that possibility as much as possible so that I could make a better comparison.
I hope if someone is on the fence in making a purchase, this information can help in some way.
Mac Studio M3 Ultra 96GB 28/60
MLX Q4 - except as otherwise notedSpecD-1.7B is speculative decoding with 1.7B model
Qwen3 models were run with reasoning turned off (tokens/sec are very similar either way)
Prompt: Explain daylight savings time
Model | Tok/Sec | Total Tokens | First Token | Disk Size |
---|---|---|---|---|
Deepseek R1 Qwen 7B | 128.58 | 1070 | 0.76s | 4.30 GB |
Deepseek R1 Qwen 14B | 66.81 | 900 | 0.35s | 8.32 GB |
Deepseek R1 Qwen 32B | 33.25 | 901 | 0.53s | 18.44 GB |
Deepseek R1 Llama 70B | 16.45 | 953 | 0.39s | 39.71 GB |
Phi-4 14B | 71.09 | 442 | 0.11s | 8.26 GB |
Gemma-3 27B | 33.89 | 1306 | 1.17s | 16.87 GB |
QWQ 32B | 33.04 | 1021 | 0.54s | 18.45 GB |
Llama 4 Scout 17B-16E | 44.67 | 572 | 3.75s | 61.14 GB |
Qwen3 1.7B | 276.20 | 474 | 0.05s | 984 MB |
Qwen3 4B | 168.68 | 538 | 0.07s | 2.28 GB |
Qwen3 8B | 114.93 | 360 | 0.08s | 4.62 GB |
Qwen3 14B | 70.33 | 560 | 0.04s | 8.32 GB |
Qwen3 14B Q8 | 41.51 | 614 | 0.07s | 15.71 GB |
Qwen3 32B | 33.88 | 507 | 0.08s | 18.45 GB |
Qwen3 32B Q8 | 20.11 | 546 | 0.12s | 34.83 GB |
Qwen3 32B Q8 (SpecD-1.7B) | 22.36 | 493 | 0.13s | 34.83 GB |
Qwen3 32B BF16 | 10.76 | 660 | 0.21s | 65.54 GB |
Qwen3 32B BF16 (SpecD-1.7B) | 17.57 | 648 | 0.16s | 65.54 GB |
Qwen3 30B-A3B | 94.95 | 389 | 0.04s | 17.19 GB |
Qwen3 30B-A3B Q8 | 75.84 | 598 | 0.05s | 32.46 GB |
Qwen3 30B-A3B BF16 | 62.62 | 471 | 0.06s | 61.08 GB |
Last edited: