M4 MAX Silicon and Running LLMs

nostradumbass · Feb 11, 2025

garylin7033 said:
Has anyone tried the MLX format of LLMs (such as this 4-bit version of Deepseek-R1 Distill Llama 70B: https://huggingface.co/mlx-community/DeepSeek-R1-Distill-Llama-70B-4bit) It is said to bring much faster performance than the GGUF format, and its size seems much smaller than the GGUF version too. Is there a difference in accuracy?

No difference in accuracy. Language models are non-deterministic so it’s always hard to predict how they will respond.

You can compare the mlx and gguf in LM Studio.

If there is a difference in speed it is something like half a token per second. gguf are already handled as well as they can be. Converting to another format isn’t going to make a noticeable difference.

garylin7033 · Feb 11, 2025

nostradumbass said:
No difference in accuracy. Language models are non-deterministic so it’s always hard to predict how they will respond.

You can compare the MLX and GUFFs in LM Studio.

If there is a difference in speed it is something like half a token per second. GUFFs are already handled as well as they can be. Converting to another format isn’t going to make a noticeable difference.

But it must be said that the GGUF version is much larger than the MLX version (such as this Q4_K_M version which is ~42GB https://huggingface.co/bartowski/De...ain/DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf), so surely with the decrease in size comes an increase in speed?

nostradumbass · Feb 11, 2025

garylin7033 said:
But it must be said that the GGUF version is much larger than the MLX version (such as this Q4_K_M version which is ~42GB https://huggingface.co/bartowski/De...ain/DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf), so surely with the decrease in size comes an increase in speed?

4bit gguf and mlx are both circa 40GB from the common sources such as unsloth, bartowski, mlxcommunity. You can search them all neatly in one place in LM Studio.

4bit quants is just not worth it for any model except to play around with or use for benchmarking your hardware. They are too unpredictable and very bad at coding tasks.

garylin7033 · Feb 11, 2025

nostradumbass said:
4bit GUFF and MLX are both circa 40GB from the common sources such as unsloth, bartowski, mlxcommunity. You can search them all neatly in one place in LM Studio.

4bit quants is just not worth it for any model except to play around with or use for benchmarking your hardware. They are too unpredictable and very bad at coding tasks.

Thanks! I thought the 4 bit MLX and the Q4_K_M GGUF that I sent were roughly the same thing. How about this 8-bit version (https://huggingface.co/mlx-community/DeepSeek-R1-Distill-Llama-70B-8bit)? It is still small in size compared to the Q4 GGUF models. Sorry for repeatedly asking instead of trying myself, but I haven't fully convinced myself to upgrade my current M1 Pro to the M4 Max yet, so I don't have the hardware to test this out currently.

nostradumbass · Feb 11, 2025

garylin7033 said:
Thanks! I thought the 4 bit MLX and the Q4_K_M GGUF that I sent were roughly the same thing. How about this 8-bit version (https://huggingface.co/mlx-community/DeepSeek-R1-Distill-Llama-70B-8bit)? It is still small in size compared to the Q4 GGUF models. Sorry for repeatedly asking instead of trying myself, but I haven't fully convinced myself to upgrade my current M1 Pro to the M4 Max yet, so I don't have the hardware to test this out currently.

You’ll get about 10-15 tkn/s either gguf or mlx. You’ll need 128GB memory since there is no longer a 96GB option.

garylin7033 · Feb 12, 2025

nostradumbass said:
You’ll get about 10-15 tks/s either gguf or mlx. You’ll need 128GB memory since there is no longer a 96GB option.

Thanks! Would you say 64GB RAM M4 Max running 32b models would be the sweet spot for speed and accuracy, if you're aiming for 20+tks/s?

throAU · Feb 12, 2025

Ifti said:
Genuine question - what exactly are you running your LLM for? I mean, what's your use case for it?
I find the topic interesting, but struggle to see what I would use a LLM for if I were to ever set one up.....

Just last week, I had a quote with a bunch of columns in it for some renewals and I asked the LLM to do this with the PDF file dropped into it:

attached is a renewal quote for a bunch of Cisco equipment. in the "description" column, the device serial numbers are listed, they start with the pattern FGL.

please create a table consisting of the following values: product code, Serial number, description, total

The serial numbers were not the only thing in the column they were contained in (description column had other info as well). The quote contained much more than the table, disclaimer, terms, company logo, contact details, etc.

It nailed it, it even picked up serial numbers I didn't tell it about because they didn't start with FGL - it figured it out:

Then:

please create a csv file of this

Also nailed it - gave me some stuff to copy/paste into a text editor.

This CSV file contains all the details from the table you requested, including:

Product code: The unique identifier for each product.

Serial number: The serial numbers of the Cisco equipment (starting with "FGL").

Description: A brief description of the product and its purpose.

Total: The total cost for each item.

You can copy this CSV data into spreadsheet software like Microsoft Excel, Google Sheets, or any other application that supports CSV files to view or edit the data further.

Could I have done all this manually? Sure, but the local LLM did it in a matter of seconds (12 seconds from PDF to table - model: deepseek-r1-distill-llama-8b). It would have taken me an hour or so and I would have made mistakes no doubt.

Ifti · Feb 12, 2025

throAU said:
Just last week, I had a quote with a bunch of columns in it for some renewals and I asked the LLM to do this with the PDF file dropped into it:

The serial numbers were not the only thing in the column they were contained in (description column had other info as well). The quote contained much more than the table, disclaimer, terms, company logo, contact details, etc.

It nailed it, it even picked up serial numbers I didn't tell it about because they didn't start with FGL - it figured it out:

View attachment 2481414

Then:

Also nailed it - gave me some stuff to copy/paste into a text editor.

Could I have done all this manually? Sure, but the local LLM did it in a matter of seconds (12 seconds from PDF to table - model: deepseek-r1-distill-llama-8b). It would have taken me an hour or so and I would have made mistakes no doubt.

OK, I'm sold!

throAU · Feb 12, 2025

oh also, that was a smaller deepseek r1 distillation of llama, running in lmstudio

its just what I happened to be playing with at the time. I'm sure plenty of other models could have handled that.

nostradumbass · Feb 12, 2025

garylin7033 said:
Thanks! Would you say 64GB RAM M4 Max running 32b models would be the sweet spot for speed and accuracy, if you're aiming for 20+tks/s?

If you’re a casual user maybe but if you’re serious and want something closer to ChatGPT level performance on a local system 32b is going to look like nothing in a year. Whenever we have more GPU power and more memory then the newest models fill it up. It’s just like any app development. You give them better hardware and the software become bloatier, heavier and more compute intensive. Sometimes for no good reason.

There are some people who optimistically believe models will become smaller and more efficient but there is a limit to that, meanwhile overall the models will keep growing in complexity to become more reliable and they are always lagging behind knowledge and information in the real world.

throAU · Feb 12, 2025

nostradumbass said:
There are some people who optimistically believe models will become smaller and more efficient but there is a limit to that, meanwhile overall the models will keep growing in complexity to become more reliable and they are always lagging behind knowledge and information in the real world.

Yeah, I'm a bit in both camps.

I have optimism for local models, but I'm also a chatgpt plus subscriber and what's happening over there is pretty damn impressive. Essentially if I am processing any of my own private data (like the example above), I do it locally, general questions, etc. I use chatgpt as appropriate.

nostradumbass · Feb 12, 2025

throAU said:
Yeah, I'm a bit in both camps.

I have optimism for local models, but I'm also a chatgpt plus subscriber and what's happening over there is pretty damn impressive. Essentially if I am processing any of my own private data (like the example above), I do it locally, general questions, etc. I use chatgpt as appropriate.

For coding that is most most efficient way to do it. Ask generic questions to GitHub Copilot (which is ChatGPT and Sonnet) and private project details to the local model.

Can’t keep Qwen Coder in VRAM all the time because my projects and apps need the memory.

buggz · Feb 12, 2025

throAU said:
Just last week, I had a quote with a bunch of columns in it for some renewals and I asked the LLM to do this with the PDF file dropped into it:

How were you able to attach files for the LLM to work with?
I am using LM-Studio, BTW.

Thanks!

throAU · Feb 12, 2025

buggz said:
How were you able to attach files for the LLM to work with?
I am using LM-Studio, BTW.

Thanks!

Screenshot 2025-02-13 at 12.18.44 am.png

buggz · Feb 12, 2025

Doh! So simple.
Thanks!

buggz · Feb 12, 2025

FYI...

My M4 Mac Mini Pro, 64GB, shows the following on the last session:

Assistant
qwen2.5-coder-32b-instruct@q8_0

5.65 tok/sec
828 tokens
1.67s to first token

Search

Search

M4 MAX Silicon and Running LLMs

nostradumbass

Suspended

garylin7033

macrumors newbie

nostradumbass

Suspended

garylin7033

macrumors newbie

nostradumbass

Suspended

garylin7033

macrumors newbie

throAU

macrumors G4

Ifti

macrumors 601

throAU

macrumors G4

nostradumbass

Suspended

throAU

macrumors G4

nostradumbass

Suspended

buggz

macrumors regular

throAU

macrumors G4

buggz

macrumors regular

buggz

macrumors regular

Our Staff