Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
Has anyone tried the MLX format of LLMs (such as this 4-bit version of Deepseek-R1 Distill Llama 70B: https://huggingface.co/mlx-community/DeepSeek-R1-Distill-Llama-70B-4bit) It is said to bring much faster performance than the GGUF format, and its size seems much smaller than the GGUF version too. Is there a difference in accuracy?

No difference in accuracy. Language models are non-deterministic so it’s always hard to predict how they will respond.

You can compare the mlx and gguf in LM Studio.

If there is a difference in speed it is something like half a token per second. gguf are already handled as well as they can be. Converting to another format isn’t going to make a noticeable difference.
 
Last edited:
  • Like
Reactions: garylin7033
No difference in accuracy. Language models are non-deterministic so it’s always hard to predict how they will respond.

You can compare the MLX and GUFFs in LM Studio.

If there is a difference in speed it is something like half a token per second. GUFFs are already handled as well as they can be. Converting to another format isn’t going to make a noticeable difference.
But it must be said that the GGUF version is much larger than the MLX version (such as this Q4_K_M version which is ~42GB https://huggingface.co/bartowski/De...ain/DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf), so surely with the decrease in size comes an increase in speed?
 
But it must be said that the GGUF version is much larger than the MLX version (such as this Q4_K_M version which is ~42GB https://huggingface.co/bartowski/De...ain/DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf), so surely with the decrease in size comes an increase in speed?

4bit gguf and mlx are both circa 40GB from the common sources such as unsloth, bartowski, mlxcommunity. You can search them all neatly in one place in LM Studio.

4bit quants is just not worth it for any model except to play around with or use for benchmarking your hardware. They are too unpredictable and very bad at coding tasks.
 
Last edited:
  • Like
Reactions: garylin7033
4bit GUFF and MLX are both circa 40GB from the common sources such as unsloth, bartowski, mlxcommunity. You can search them all neatly in one place in LM Studio.

4bit quants is just not worth it for any model except to play around with or use for benchmarking your hardware. They are too unpredictable and very bad at coding tasks.
Thanks! I thought the 4 bit MLX and the Q4_K_M GGUF that I sent were roughly the same thing. How about this 8-bit version (https://huggingface.co/mlx-community/DeepSeek-R1-Distill-Llama-70B-8bit)? It is still small in size compared to the Q4 GGUF models. Sorry for repeatedly asking instead of trying myself, but I haven't fully convinced myself to upgrade my current M1 Pro to the M4 Max yet, so I don't have the hardware to test this out currently.
 
Thanks! I thought the 4 bit MLX and the Q4_K_M GGUF that I sent were roughly the same thing. How about this 8-bit version (https://huggingface.co/mlx-community/DeepSeek-R1-Distill-Llama-70B-8bit)? It is still small in size compared to the Q4 GGUF models. Sorry for repeatedly asking instead of trying myself, but I haven't fully convinced myself to upgrade my current M1 Pro to the M4 Max yet, so I don't have the hardware to test this out currently.

You’ll get about 10-15 tkn/s either gguf or mlx. You’ll need 128GB memory since there is no longer a 96GB option.
 
Last edited:
  • Like
Reactions: garylin7033
You’ll get about 10-15 tks/s either gguf or mlx. You’ll need 128GB memory since there is no longer a 96GB option.
Thanks! Would you say 64GB RAM M4 Max running 32b models would be the sweet spot for speed and accuracy, if you're aiming for 20+tks/s?
 
Genuine question - what exactly are you running your LLM for? I mean, what's your use case for it?
I find the topic interesting, but struggle to see what I would use a LLM for if I were to ever set one up.....

Just last week, I had a quote with a bunch of columns in it for some renewals and I asked the LLM to do this with the PDF file dropped into it:

attached is a renewal quote for a bunch of Cisco equipment. in the "description" column, the device serial numbers are listed, they start with the pattern FGL.

please create a table consisting of the following values: product code, Serial number, description, total

The serial numbers were not the only thing in the column they were contained in (description column had other info as well). The quote contained much more than the table, disclaimer, terms, company logo, contact details, etc.

It nailed it, it even picked up serial numbers I didn't tell it about because they didn't start with FGL - it figured it out:

Screenshot 2025-02-12 at 4.58.18 pm.png



Then:
please create a csv file of this


Also nailed it - gave me some stuff to copy/paste into a text editor.
This CSV file contains all the details from the table you requested, including:

  • Product code: The unique identifier for each product.
  • Serial number: The serial numbers of the Cisco equipment (starting with "FGL").
  • Description: A brief description of the product and its purpose.
  • Total: The total cost for each item.
You can copy this CSV data into spreadsheet software like Microsoft Excel, Google Sheets, or any other application that supports CSV files to view or edit the data further.


Could I have done all this manually? Sure, but the local LLM did it in a matter of seconds (12 seconds from PDF to table - model: deepseek-r1-distill-llama-8b). It would have taken me an hour or so and I would have made mistakes no doubt.
 
Last edited:
Just last week, I had a quote with a bunch of columns in it for some renewals and I asked the LLM to do this with the PDF file dropped into it:



The serial numbers were not the only thing in the column they were contained in (description column had other info as well). The quote contained much more than the table, disclaimer, terms, company logo, contact details, etc.

It nailed it, it even picked up serial numbers I didn't tell it about because they didn't start with FGL - it figured it out:

View attachment 2481414


Then:



Also nailed it - gave me some stuff to copy/paste into a text editor.



Could I have done all this manually? Sure, but the local LLM did it in a matter of seconds (12 seconds from PDF to table - model: deepseek-r1-distill-llama-8b). It would have taken me an hour or so and I would have made mistakes no doubt.

OK, I'm sold!
 
  • Like
Reactions: throAU
oh also, that was a smaller deepseek r1 distillation of llama, running in lmstudio

its just what I happened to be playing with at the time. I'm sure plenty of other models could have handled that.
 
Thanks! Would you say 64GB RAM M4 Max running 32b models would be the sweet spot for speed and accuracy, if you're aiming for 20+tks/s?

If you’re a casual user maybe but if you’re serious and want something closer to ChatGPT level performance on a local system 32b is going to look like nothing in a year. Whenever we have more GPU power and more memory then the newest models fill it up. It’s just like any app development. You give them better hardware and the software become bloatier, heavier and more compute intensive. Sometimes for no good reason.

There are some people who optimistically believe models will become smaller and more efficient but there is a limit to that, meanwhile overall the models will keep growing in complexity to become more reliable and they are always lagging behind knowledge and information in the real world.
 
There are some people who optimistically believe models will become smaller and more efficient but there is a limit to that, meanwhile overall the models will keep growing in complexity to become more reliable and they are always lagging behind knowledge and information in the real world.

Yeah, I'm a bit in both camps.

I have optimism for local models, but I'm also a chatgpt plus subscriber and what's happening over there is pretty damn impressive. Essentially if I am processing any of my own private data (like the example above), I do it locally, general questions, etc. I use chatgpt as appropriate.
 
Yeah, I'm a bit in both camps.

I have optimism for local models, but I'm also a chatgpt plus subscriber and what's happening over there is pretty damn impressive. Essentially if I am processing any of my own private data (like the example above), I do it locally, general questions, etc. I use chatgpt as appropriate.

For coding that is most most efficient way to do it. Ask generic questions to GitHub Copilot (which is ChatGPT and Sonnet) and private project details to the local model.

Can’t keep Qwen Coder in VRAM all the time because my projects and apps need the memory.
 
Last edited:
  • Like
Reactions: throAU
Just last week, I had a quote with a bunch of columns in it for some renewals and I asked the LLM to do this with the PDF file dropped into it:
How were you able to attach files for the LLM to work with?
I am using LM-Studio, BTW.

Thanks!
 
FYI...

My M4 Mac Mini Pro, 64GB, shows the following on the last session:

Assistant
qwen2.5-coder-32b-instruct@q8_0

5.65 tok/sec
828 tokens
1.67s to first token
 
  • Like
Reactions: nostradumbass
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.