Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

OldMike

macrumors 6502a
Original poster
Mar 3, 2009
551
251
Dallas, TX
When the Mac Studios were released, I had a tough time deciding which model to get. It mostly came down to a M4 Max 128GB 16/40 or a M3 Ultra 96GB 28/60. Micro Center was having a great deal on the M3 Ultras and I ended up getting one right before the deal ended (I also ended up picking up a base Mac Studio M4 Max to replace an older Mac Mini).

Part of the decision was deciding what would be better for LLM usage - the base M3 Ultra or the 128GB M4 Max. Since the price was extremely close due to the deal Micro Center was having - it made the decision especially difficult. In a normal year, where the Max and Ultra would be of the same generation - the decision wouldn't have been as difficult. I ended up going with the M3 Ultra, because I believe in most scenarios it would outperform the M4 Max for LLM usage.

I mostly use this as a backend for serving LLMs to VS Code with the Continue.dev extension. I also have a ProxMox Linux VM running Open WebUI that is pointing at the M3 Ultra as well. I am using LM Studio to serve the OpenAI API service.

I ran some simple performance benchmarks on the M3 Ultra, and just wanted to post these results in case it helps anyone. I used an extremely simple prompt - so that I could see what the best case scenario would be for each size model. More complex prompts could come back with vastly different results - and I wanted to limit that possibility as much as possible so that I could make a better comparison.

I hope if someone is on the fence in making a purchase, this information can help in some way.


Mac Studio M3 Ultra 96GB 28/60​

MLX Q4 - except as otherwise noted
SpecD-1.7B is speculative decoding with 1.7B model
Qwen3 models were run with reasoning turned off (tokens/sec are very similar either way)

Prompt: Explain daylight savings time


ModelTok/SecTotal TokensFirst TokenDisk Size
Deepseek R1 Qwen 7B128.5810700.76s4.30 GB
Deepseek R1 Qwen 14B66.819000.35s8.32 GB
Deepseek R1 Qwen 32B33.259010.53s18.44 GB
Deepseek R1 Llama 70B16.459530.39s39.71 GB
Phi-4 14B71.094420.11s8.26 GB
Gemma-3 27B33.8913061.17s16.87 GB
QWQ 32B33.0410210.54s18.45 GB
Llama 4 Scout 17B-16E44.675723.75s61.14 GB
Qwen3 1.7B276.204740.05s984 MB
Qwen3 4B168.685380.07s2.28 GB
Qwen3 8B114.933600.08s4.62 GB
Qwen3 14B70.335600.04s8.32 GB
Qwen3 14B Q841.516140.07s15.71 GB
Qwen3 32B33.885070.08s18.45 GB
Qwen3 32B Q820.115460.12s34.83 GB
Qwen3 32B Q8 (SpecD-1.7B)22.364930.13s34.83 GB
Qwen3 32B BF1610.766600.21s65.54 GB
Qwen3 32B BF16 (SpecD-1.7B)17.576480.16s65.54 GB
Qwen3 30B-A3B94.953890.04s17.19 GB
Qwen3 30B-A3B Q875.845980.05s32.46 GB
Qwen3 30B-A3B BF1662.624710.06s61.08 GB
 
Last edited:
I was on the fence when buying my Studio. I went for the M4 Max in the end because of the 128GB RAM.

I’ve recently been playing with qwen 3 235b (which can be squeezed into 88GB). I wouldn’t be doing that with 96GB RAM.

If the base M3 Ultra had been 128GB rather than 96GB, it would have been a whole different story for me.
 
  • Wow
Reactions: _Mitchan1999
If I would have known that Qwen3 would have had a highly performant MoE model that would have fit in the 32GB RAM difference, it might have pushed me to custom order the M4 Max with 128GB.

The problem is that the full sized 235B model is over 500GB and is only marginally better than the full size 32B, though it will probably run faster. Once 235B gets quantized down it loses its advantage in the results it provides. So for me, with what's available today, I'm not sure how much an advantage the 128GB vs 96GB would be vs having faster inference speed (in most cases) with the Ultra.

I'm curious how fast 235B runs on the M4 Max, I'm sure it is quite quick, despite being so large.

I was very surprised at how fast any machine that could fit 30B into VRAM can run. For coding, the results I am getting with the 30B non quantized model is not as good as 32B with Q4 quantization or above.

The problem I have with the Qwen3 models is how long the reasoning process is. On a semi complex coding question, I have had thinking times up to 47 minutes on different Qwen3 models. Sometimes during reasoning, I've actually had some of the models get stuck in a loop. Switching reasoning off for 32B does not affect the output nearly as much as switching reasoning off on the MoE models.

Despite how fast the 30B runs on the Mac Studio, for now I have pretty much settled on running Qwen3-32B full sized model (around 64GB) when I want the most accuracy, and I will run 32B Q4 if I want a little more speed. For non coding usage, I will probably also use 30B because of how speedy it is.
 
I have one of the Qwen/QwQ models (Qwen_QwQ-32B-IQ2_S) on my 24GB MacBook Pro and, for my purposes (brainstorming my story), I’ve been pretty impressed with it. Previously I hadn’t gone lower than 4-bit and any 2-bit model I tried was pointless.

I run Qwen3-235B-A22B-UD-Q2_K_XL on my Studio, but only when I’m not doing much else with it. I’ve used a lot worse models in my time.
 
  • Like
Reactions: OldMike
I run Qwen3-235B-A22B-UD-Q2_K_XL on my Studio, but only when I’m not doing much else with it. I’ve used a lot worse models in my time.


Have you seen these writing benchmarks before? Qwen3-235B does extremely well on these. Very impressive. Qwen3-32B does not have impressive scores, in comparison, for writing - even the previous QWQ-32B outperforms it on these tests. The multiple benchmarks provided on Qwen3s release blog didn't show such a gap between the two in any of the general AI benchmarks that they ran. I guess that proves, once again, that it is always important to compare these models for your exact usage domain.

I don't use these for writing so I would have never guessed the differences would be so large compared to other benchmarks.
 

Have you seen these writing benchmarks before? Qwen3-235B does extremely well on these. Very impressive. Qwen3-32B does not have impressive scores, in comparison, for writing - even the previous QWQ-32B outperforms it on these tests. The multiple benchmarks provided on Qwen3s release blog didn't show such a gap between the two in any of the general AI benchmarks that they ran. I guess that proves, once again, that it is always important to compare these models for your exact usage domain.

I don't use these for writing so I would have never guessed the differences would be so large compared to other benchmarks.
That’s very insightful, thank-you. I’ll bookmark that for later.

I'm curious how fast 235B runs on the M4 Max, I'm sure it is quite quick, despite being so large.

Here are the results of a (very unscientific test):

One primary use of local LLMs for my purposes is to summarise scenes I’ve written to better help me glance through my files without needing to read the entire thing.

The following tests were done on LM Studio.

The prompt I’ve used is: “Summarise this scene in 400-500 words, focusing on <character>’s emotions, describing exactly what <character> sees, and analysing <other character’s> character and attitude. Then list the most prominent key moments from the scene.” I also have a system prompt giving some information of the story’s overall theme, context, environment, and characters (about 512 tokens).

Model: Qwen3 235B A22B UD (qwen3-235b-a22b)
Quant: Q2_K_XL
Context length: 20,296 tok (max for this model is 40,960)

Scene 1 (1,708 words): Thinking 26.24sec • 13.04 tok/sec • 1201 tokens • 37.52s to first token
Scene 2 (1,545 words): Thinking 44.55sec • 10.78 tok/sec • 1330 tokens • 60.33s to first token
Scene 3 (3,404 words): Thinking 50.75sec • 8.22 tok/sec • 1294 tokens • 135.81s to first token

Memory Used: ~103.6GB
Cached Files: ~14.39GB

Observationally, scene 1’s response was faster than my comfortable reading speed. Scene 2 is slightly faster than my comfortable reading speed. Scene 3 was a touch slower than my reading speed. I expect it would be faster on the M3 Ultra with sufficient RAM due to the higher memory bandwidth.

Prompt processing time on scene 3 was much more noticeable than on the others (probably due to the scene being twice as long).

Quality: There’s some minor hallucination (the LLM uses elements and words not actually in the scene but that could have been derived from the scene - they do “fit” in with the tone and style of the scene, it’s not “way out bonkers”). Maybe if I’m more specific with my prompt & system prompt, I could tighten that up a bit.

I’d say it’s generally in the region of 90% usable. I’m pretty happy with it. Due to the memory used, I can’t really do anything else of significance while using this LLM (I have Obsidian open, plus my browser, Notes, Terminal, Activity Monitor, etc, but nothing stressful).
 
  • Like
Reactions: OldMike
That’s very insightful, thank-you. I’ll bookmark that for later.



Here are the results of a (very unscientific test):

One primary use of local LLMs for my purposes is to summarise scenes I’ve written to better help me glance through my files without needing to read the entire thing.

The following tests were done on LM Studio.

The prompt I’ve used is: “Summarise this scene in 400-500 words, focusing on <character>’s emotions, describing exactly what <character> sees, and analysing <other character’s> character and attitude. Then list the most prominent key moments from the scene.” I also have a system prompt giving some information of the story’s overall theme, context, environment, and characters (about 512 tokens).

Model: Qwen3 235B A22B UD (qwen3-235b-a22b)
Quant: Q2_K_XL
Context length: 20,296 tok (max for this model is 40,960)

Scene 1 (1,708 words): Thinking 26.24sec • 13.04 tok/sec • 1201 tokens • 37.52s to first token
Scene 2 (1,545 words): Thinking 44.55sec • 10.78 tok/sec • 1330 tokens • 60.33s to first token
Scene 3 (3,404 words): Thinking 50.75sec • 8.22 tok/sec • 1294 tokens • 135.81s to first token

Memory Used: ~103.6GB
Cached Files: ~14.39GB

Observationally, scene 1’s response was faster than my comfortable reading speed. Scene 2 is slightly faster than my comfortable reading speed. Scene 3 was a touch slower than my reading speed. I expect it would be faster on the M3 Ultra with sufficient RAM due to the higher memory bandwidth.

Prompt processing time on scene 3 was much more noticeable than on the others (probably due to the scene being twice as long).

Quality: There’s some minor hallucination (the LLM uses elements and words not actually in the scene but that could have been derived from the scene - they do “fit” in with the tone and style of the scene, it’s not “way out bonkers”). Maybe if I’m more specific with my prompt & system prompt, I could tighten that up a bit.

I’d say it’s generally in the region of 90% usable. I’m pretty happy with it. Due to the memory used, I can’t really do anything else of significance while using this LLM (I have Obsidian open, plus my browser, Notes, Terminal, Activity Monitor, etc, but nothing stressful).

Thank you for all of that great information. I've been playing with all of the Qwen3 models except for that one since I can't fit it into 96GB, so I've really been wondering how it ran on the M4 Max.

I'm pretty sure that I have seen you write about which version you run in a different post (though it might have been someone else) - but I will still ask, is that the MLX version? I ask because I didn't notice a Q2 MLX version yet. If it is the GGUF version, I would imagine the MLX version, if a Q2 version existed, would run quite a bit faster. The other benefit of running MLX models is that you don't have to pre-allocate VRAM for context size. Because it is able to allocate the VRAM dynamically, even when a model is loaded, you still have use of the RAM for your system most of the time.

I'm not sure if you have seen these recommendations from the Qwen3 team on HuggingFace - but these are the adjustments they recommend when running their models: (which I meant to ask - do you remember how long were those three scenes thinking for before it began to respond, if you had reasoning mode enabled (/think or /no_think)?:

For thinking mode, use Temperature=0.6, TopP=0.95, TopK=20, and MinP=0 (the default setting in generation_config.json). DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions. For more detailed guidance, please refer to the Best Practices section.

For non-thinking mode, we suggest using Temperature=0.7, TopP=0.8, TopK=20, and MinP=0. For more detailed guidance, please refer to the Best Practices section.


Just in case you are not sure (I might not be telling you anything new, but maybe it will help someone else), these settings can be set when you go to 'My Models' on the left menu bar, click on the settings icon on the right for that particular model and then go to the inference tab. I have tried testing both with and without these settings and I am still not sure how important these settings are - but it is recommended by the people who made the models. On Thinking mode it might have produced less hallucinations with these settings. There were a few times, without these settings, that I got into an infinite loop during thinking mode when having it analyze a piece of code.

Thanks again for a great post!
 
I'm pretty sure that I have seen you write about which version you run in a different post (though it might have been someone else) - but I will still ask, is that the MLX version? I ask because I didn't notice a Q2 MLX version yet. If it is the GGUF version, I would imagine the MLX version, if a Q2 version existed, would run quite a bit faster. The other benefit of running MLX models is that you don't have to pre-allocate VRAM for context size. Because it is able to allocate the VRAM dynamically, even when a model is loaded, you still have use of the RAM for your system most of the time.

I'm not sure if you have seen these recommendations from the Qwen3 team on HuggingFace - but these are the adjustments they recommend when running their models: (which I meant to ask - do you remember how long were those three scenes thinking for before it began to respond, if you had reasoning mode enabled (/think or /no_think)?:
I couldn’t find an MLX version that would fit in my RAM. The only one I found on LM Studio (I didn’t do an exhaustive search on Hugging Face) is a 4bit version that wants 132GB RAM so that’s a non-started for me. Although, now I look I see there is Qwen3-235B-A22B-3bit at 103GB (the non-MLX version is 111GB). I very much doubt that will work, but I’ll try and get it downloaded to see if I can do something - maybe with a small context length - without it crashing. For interest, the “next size up” - 8-bit - is 249GB, which wouldn’t even squeeze into a 256GB M3 Ultra Mac Studio.

The only work I’ve done with this model so far have included thinking - because it hadn’t occurred to me to have it “not thinking”. I’ve used many non-thinking models before, so I don’t intend to turn it off for this one.

Thank-you for those recommended settings for this model. I’ll use those when I next get it to work on my scenes and see what difference it makes. When I was working on my own llama.cpp app, I did play about with those settings and they did seem to make a difference - mostly on the more wayward (smaller) models - but it’s still a bit of a mystery what they actually mean or do.

I did list the thinking time for the three scenes in my previous post: 26.24sec / 44.55sec / 50.75sec
 
I have one of the Qwen/QwQ models (Qwen_QwQ-32B-IQ2_S) on my 24GB MacBook Pro and, for my purposes (brainstorming my story), I’ve been pretty impressed with it. Previously I hadn’t gone lower than 4-bit and any 2-bit model I tried was pointless.

I run Qwen3-235B-A22B-UD-Q2_K_XL on my Studio, but only when I’m not doing much else with it. I’ve used a lot worse models in my time.
Have you tried lower parameter models with higher quants for your use case? I find anything under Q4 inaccurate in any parameter size.
 
Have you tried lower parameter models with higher quants for your use case? I find anything under Q4 inaccurate in any parameter size.
On my Mac Studio, I’m mostly just pushing its limits. I’m only using it on days when I’m not at work, so I’ve been doing the odd hour or two on my MacBook Pro between shifts.

On my MacBook Pro (24GB), I’m currently switching between qwen_qwq-32b (qwen2) and Qwen3 8B. There is some difference between the two.

When asked the “Create a summary of this scene in approximately 500-600 words. Include all key emotions and significant highlights.”, Qwen 3’s thinking time was 18 seconds and its response was 22.43 tok/sec (1269 tokens, 11.58s to first token); and qwen_qwq-32b’s thinking time was 66 seconds and its response was 8 tok/sec (1605 tokens, 47.72s to first token).

On comparing the output, qwen_qwq-32b did the better job. It picked out more detail from the scene. It created a one-paragraph summary followed by five clearly headed paragraphs (such as “Undercurrents and Foreshadowing”, “Jealousy and Possession”), followed by a list of four key highlights. It finishes with a short overview of the scene’s “emotional core”.

Qwen3-8B provides a six paragraph summary, not headed, no list of key highlights (so it ignored that part of the prompt). Each paragraph is kind of like the ones qwen_qwq-32b offered, without the headers, but it doesn’t feel quite as “to the point” as the other one.

Don’t get me wrong, I’m not unhappy about either’s response - it feels like trying to split hairs - and I’m not really comparing apples to apples (the 32B model is based on Qwen2), but it feels like the 32B parameter model actually read the scene, understood the prompt, and structured its output better. On my MacBook, I would stick with the qwen_qwq-32b despite it being IQ2 in favour of the Qwen 3 8B model despite that being Q4.

I think for creative tasks (such as my story writing), the larger parameters is more useful than the quant level.

For other tasks, as you’ve suggested, less parameters at Q4 and better might be the best option.
 
I’ve been doing a little test to see if the MLX version of Qwen3-235B-A22B-3bit will work on my 128GB M4 Max MacStudio.

The model size is about 104GB, so the likes of LM Studio didn’t like it. I put together a little python script to load the model and send a prompt (“Explain daylight savings time to a 5-year-old.”).

At first the system balked with:

Output:
==========

[WARNING] Generating with a model that requires 98089 MB which is close to the maximum recommended size of 98304 MB. This can be slow. See the documentation for possible work-arounds: https://github.com/ml-explore/mlx-lm/tree/main#large-models
libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Insufficient Memory

To get around that, I bumped the wired limit to above the size of the LLM model with this:

Code:
sudo sysctl iogpu.wired_limit_mb=107520

This then allowed the model to run, re:

Output:
==========

[WARNING] Generating with a model that requires 98089 MB which is close to the maximum recommended size of 107520 MB. This can be slow. See the documentation for possible work-arounds: https://github.com/ml-explore/mlx-lm/tree/main#large-models

<think>
Okay, so the user wants to explain daylight savings time to a 5-year-old. Let me start by understanding what daylight savings time is. From what I know, it's when we move the clock forward in the spring to make better use of daylight in the evenings. But how do I explain that to a child?

First, I need to simplify the concept. Maybe use an analogy they understand, like playing outside. When the clock is moved forward, we get more sunlight in the evening, so we can play longer. But then, there's the time change. How do I explain the clock moving without confusing them?

Maybe use a story. Imagine the sun is a friend who comes to play. In the summer, the sun stays up longer, so we move the clock so we can play together in the evening. In the winter, the sun goes to bed earlier, so we move the clock back to have breakfast time.

Wait, but the user said to explain the time change, not the seasons. Maybe use the clock as a character. When it's spring, the clock moves its bedtime to a later hour so we can play more in the sun. Then, in the fall, the clock moves bedtime back to the earlier hour.

I should also
==========

Prompt: 20 tokens, 28.958 tokens-per-sec
Generation: 256 tokens, 33.049 tokens-per-sec
Peak memory: 102.959 GB

With the /nothink switch, I get this:

Output:
==========
[WARNING] Generating with a model that requires 98089 MB which is close to the maximum recommended size of 107520 MB. This can be slow. See the documentation for possible work-arounds: https://github.com/ml-explore/mlx-lm/tree/main#large-models

<think>

</think>
Sure! Here's a simple explanation of daylight savings time for a 5-year-old:
---
🌞 "You know how we play outside in the light every day? In the summer, the sun stays up later, so we can play longer in the evening! To help grown-ups and kids like you enjoy the light in the evenings, we move the clock a little earlier in the spring. Then, when it gets colder and the sun doesn't stay up as long, we move the clock back again in the fall."

👉 It's like giving everyone a little more sunshine to play in the summer, and making mornings brighter in the fall!
---
Would you like a fun story or drawing to go with it? 😊
==========

Prompt: 23 tokens, 59.934 tokens-per-sec
Generation: 149 tokens, 33.045 tokens-per-sec
Peak memory: 102.939 GB

Activity Monitor showed Memory Used hitting about 105GB (but Memory Pressure stayed in the green).

I’m not sure I gain a massive amount using this 3-bit model over running the Q2_K_XL model more conveniently in LM Studio (which leaves more RAM for context and output). The exercise feels like climbing a mountain “because it’s there”, not necessarily because it’ll achieve anything.

However, just knowing that it works on this machine is something and is something I may spend more time on in the future. Maybe if I ever get around to finishing my own lightweight/minimalist GUI on top of llama, I may find it more useful for my story-writing purposes but, for the time being, I’ll leave it as a curiosity piece.
 
  • Like
Reactions: OldMike
As an addition, I went back to LM Studio (which I hadn’t tested after sudo sysctl iogpu.wired_limit_mb=107520) and ran my previous prompt ("Explain daylight savings time to me in a simple way. /nothink”), and it worked, re:


Sure! Here's a simple explanation of **Daylight Savings Time**:

### What is Dayightlight Savings Time?
Daylight Savings Time (DST) is when we **“spring forward” our clocks by one hour** in the spring, to make evenings brighter and mornings darker.

We do this because it helps us make better use of daylight during the longer summer days.

---

### The Main Idea:
- In **spring**, we “**move the clock forward**” by one hour. → 🕰️
- In **fall**, we “**move the clock back**”. ✔

---

### Example:
If in the winter it gets dark at 6 PM, then during DST:
- The **sun stays up a little later**, maybe until 7 PM.

This means you get more daylight in the evening to enjoy outdoor activities or walk home from work safely.

---

### Mnemonic:
- **“Spring ahead, Fall back.”**

That’s daylight savings time in a nutshell! 😊
Let me know if you want to learn more!

30.89 tok/sec • 220 tokens • 4.43s to first token • Stop reason: EOS Token Found

Activity Monitor is showing Memory Used: 109.60GB (this is with LM Studio running, my browser, terminal, etc, but nothing much else), with LM Studio Helper using 96.39GB.
 
  • Like
Reactions: OldMike
I’ve just asked the 3-bit model through LM Studio to summarise one of my scenes and provide a list of key moments. The scene is 3,000+ words and my system prompt is just over 1000 tokens.

Thinking time was 13.33 seconds.
26.20 tok/sec • 887 tokens • 18.17s to first token • Stop reason: EOS Token Found

I increased the context length to 16298 for this test, and Memory Used is currently at 110GB, but I now also have Obsidian open, (memory pressure still in the green).

This is actually *much* quicker and more responsive than the non-MLX Q2_K_XL version I had been using before, so I think I’m going to stick with it for now. When I need to do some work on a much higher context length (or when I’m running other stuff alongside it), I’ll may end up using something else (the max context length here is 40,960 apparently - but I’ve not yet gone that high with the 3-bit version).
 
I’ve just asked the 3-bit model through LM Studio to summarise one of my scenes and provide a list of key moments. The scene is 3,000+ words and my system prompt is just over 1000 tokens.

Thinking time was 13.33 seconds.
26.20 tok/sec • 887 tokens • 18.17s to first token • Stop reason: EOS Token Found

I increased the context length to 16298 for this test, and Memory Used is currently at 110GB, but I now also have Obsidian open, (memory pressure still in the green).

This is actually *much* quicker and more responsive than the non-MLX Q2_K_XL version I had been using before, so I think I’m going to stick with it for now. When I need to do some work on a much higher context length (or when I’m running other stuff alongside it), I’ll may end up using something else (the max context length here is 40,960 apparently - but I’ve not yet gone that high with the 3-bit version).
I had to discontinue using MLX quants due to their excessive hallucinations. While they were faster, their inaccuracy rendered them unsuitable for my needs.

I primarily utilize large language models for quantitative analysis, which is distinctly different from creative writing!
 
I had to discontinue using MLX quants due to their excessive hallucinations. While they were faster, their inaccuracy rendered them unsuitable for my needs.

I primarily utilize large language models for quantitative analysis, which is distinctly different from creative writing!
Yes, there is some minor hallucinations in the summaries I ask the LLMs to do. It’s nothing I can’t deal with and, for my use case, creativity is more important. They generally relate to moments or dialogue being attributed to the wrong character.

I suspect if I tweak the settings and prompt, I may be able to reduce the hallucinations.

I think one cause of the issue is that my prompt is quite long (usually a scene of a few thousand words, a character file of a few thousand words, and a system prompt of around a thousand tokens). But I’ll work on it.
 
I had to discontinue using MLX quants due to their excessive hallucinations. While they were faster, their inaccuracy rendered them unsuitable for my needs.

I primarily utilize large language models for quantitative analysis, which is distinctly different from creative writing!
For interest, I asked the LLM to read, summarise, and analyse a 7-act scene paying specific attention to character dynamics and to be sure to attribute dialogue to the correct character. It’s about 5,500 words; and I also included a “character file” for background (another 3,000 words).

I ran the same query through the MLX 235B Q3 Qwen3 and then again through the non-MLX 32B Q8 Qwen3.

In terms of the “minor hallucinations” I mentioned before, they’re present in the output from both - so the quant didn’t make much difference on that. It would be difficult to prefer one over the other from this test.

Response comparisons (although this test was about looking for hallucinations rather than speed):

MLX 235B Q3: 56s thinking • 19.43 tok/sec • 2973 tokens • 87.57s to first token
32B Q8: 2m 19s thinking • 4.96 tok/sec • 2849 tokens • 94.41s to first token
 
For interest, I asked the LLM to read, summarise, and analyse a 7-act scene paying specific attention to character dynamics and to be sure to attribute dialogue to the correct character. It’s about 5,500 words; and I also included a “character file” for background (another 3,000 words).

I ran the same query through the MLX 235B Q3 Qwen3 and then again through the non-MLX 32B Q8 Qwen3.

In terms of the “minor hallucinations” I mentioned before, they’re present in the output from both - so the quant didn’t make much difference on that. It would be difficult to prefer one over the other from this test.

Response comparisons (although this test was about looking for hallucinations rather than speed):

MLX 235B Q3: 56s thinking • 19.43 tok/sec • 2973 tokens • 87.57s to first token
32B Q8: 2m 19s thinking • 4.96 tok/sec • 2849 tokens • 94.41s to first token
Prompt crafting is becoming something of a dark art with the latest models, as what proves effective for one may be suboptimal for another.
 
  • Like
Reactions: JSRinUK
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.