Has anyone tried the MLX format of LLMs (such as this 4-bit version of Deepseek-R1 Distill Llama 70B: https://huggingface.co/mlx-community/DeepSeek-R1-Distill-Llama-70B-4bit) It is said to bring much faster performance than the GGUF format, and its size seems much smaller than the GGUF version too. Is there a difference in accuracy?
No difference in accuracy. Language models are non-deterministic so it’s always hard to predict how they will respond.
You can compare the mlx and gguf in LM Studio.
If there is a difference in speed it is something like half a token per second. gguf are already handled as well as they can be. Converting to another format isn’t going to make a noticeable difference.
Last edited: