Thanks for your informative post. I am also looking at running models around the same size, since as they get larger the performance deteriorates too quickly. Can you give an estimate of how many tokens/sec you were getting out of your M4 Max running the same size model?
I have been on the fence of getting a base Ultra, or the same spec M4 Max as you (128GB/1TB SSD). When comparing a M3 Max and M4 Pro with way less GPU cores, the M4 Pro was keeping up pretty good on MLX, and I was wondering if it could be the Arm V9/SME difference.
I'm actually away right now, and when I get back I will be pulling the trigger on a Mac Studio - but I can't get over the fact that the M4 Max in that configuration is so close to the M3 Ultra. Any other year, I would go for the Ultra - but this year its quite the dilemma.
If it helps with my M3 Ultra I get 24.34 tok/sec with qwen2.5-32b-instruct Q5_K_M GGUF.
MLX would be faster but I find it less accurate.
If there is a particular model you want tested let me know.