Thanks for your informative post. I am also looking at running models around the same size, since as they get larger the performance deteriorates too quickly. Can you give an estimate of how many tokens/sec you were getting out of your M4 Max running the same size model?
I have been on the fence of getting a base Ultra, or the same spec M4 Max as you (128GB/1TB SSD). When comparing a M3 Max and M4 Pro with way less GPU cores, the M4 Pro was keeping up pretty good on MLX, and I was wondering if it could be the Arm V9/SME difference.
I'm actually away right now, and when I get back I will be pulling the trigger on a Mac Studio - but I can't get over the fact that the M4 Max in that configuration is so close to the M3 Ultra. Any other year, I would go for the Ultra - but this year its quite the dilemma.
I need to pull up my old code base that runs a timer this QWQ32 does not take verbose parameter so I cannot see actual token/s but I'll say that the older M3 max I had with 64GB memory was not fast but not too slow - more like "annoying" - so that one ran around 10T/s to about 8T/s. Anything below 10T/s to me is just annoyingly slow. Does it still work, yes of course but it's just too slow for even my kids to query. The M4Max is much faster. But let me get all my old stuff and put it back on this new machine. I also picked up a new MBpro 12c/16c 512GB base since I do not use the M3 MAX laptop anymore (sold it off!) - the new machine is just a dream. Let me digress:
Apple's M4 lineup is just so awesome - I had my M1 Pro, M1 MAX (14/16 inch) - the 14 M1 Pro had fan noise at basic tasks, so I upped to a 16 inch then used it for a long while SITTING ON A DESK PLUGGED IN FOREVER. I waited until the M3 MAX 16-inch and thought this is the one! Then it spent 18 months on a desk, lol, yet again. I bought an M3 13.6" and let me tell you it's just so great the size and weight. I used that laptop more than I did my M3 MAX when I was around the house, outdoors getting my kids, etc. Until recently I sold the M3 MAX because it's just plugged in all the time I wanted a desktop. The New Mac studios were out and I just wanted more power in a Mac. So I ordered the M3U base config but decided NO after some LLM tests which proves it's just not fast enough compute at this moment in time and even with such a huge 512GB memory upgrade it makes no sense to use a Mac the inference is just terribly slow.
anyway, decided on the M4 Max with 128GB memory because using my 64GB M3 MAX, even with this QWQ32 4-bit it ate so much memory I ran out for all the other tasks on the laptop. It was actually struggling! but now the 128GB memory has enough room for a small LLM and all the tasks I need to run in the background (Xcode, visual studio code, affinity photo, etc).
MLX is optimized for sure that's why I'm getting way better inference response with this small llm. The M4 is NEW so the SME extension is now included and used automatically with updated compilers, etc. The older one AMX on the M3 and prior.
Honestly this is an expensive purchase for me to justify (for a MAC system) - PC you can game at least and I got 2x 3090 gpus which is great but man the heat and the electricity is just not good. My small 100sq ft gaming room heats up so warm that I sweat when gaming CoD. ridiculous. But that's PC for ya. I don't use iPads anymore so I opted for a desktop this time around and I am going to keep it for a long while. It is the perfect desktop machine for me because it's so efficient and quiet and just does everything I want plus more. Well AI/ML engineering stuff I do still run on the PC for the sheer compute power but the Mac Studio is just amazing for everything else not AI.
The 14inch MBPro 12c is so perfect the size is perfect for me - the MBAir is better form factor but it's not that bad for me to go up to a 14inch. So if you're on the fence to get a Mac Studio get the M4 MAX even the base config but up the ram. No one needs 128GB memory for day to day stuff it's this niche AI LLM stuff that requires it. The M3Ultra has its place but man the price does not work for me and its old architecture is outdated.
I'll get a real token/s number soon. The M5 will be a slight upgrade but until they really shrink the node on the M6 I don't know if you're gonna get such substantial performance gains going from M3 -> M4. Just the single core performance is worth it. Not many apps use multicore so single core performance is the major factor in how snappy everything runs.