Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

innerproduct

macrumors regular
Original poster
Jun 21, 2021
241
367
Even if Apple intelligence is quite underwhelming right now, I assume they are in it for the long run. M2 ultra is obviously not the best fit even if it seems quite competitive for inference on a perf/watt basis (just speculation)
So, let’s assume apple is building their rumored new server/ws chip what would be the most sane approach from apples perspective? Something that also allows the actual training as well?
So, with the known building blocks at their disposal what could be good approaches? Doing like amd and nvidia with HBM and lots of tensor cores targeting 4 and 8 bit? Some more general approach?
Do “we” have any insights or ideas at all at this time?
What memory types? Interconnects? Balance between cpu , gpu and tensor cores? Clearly RT cores would just occupy space for these purposes but maybe still there but unused in order to get the benefits of reusing and also delivering a high end workstation chip?

Also, what frameworks would they be running? Pytorch is clearly broken on mac and mlx is more like a experimental framework as it is right now. CoreML for inference perhaps but clearly not for training.
Let the speculation run wild 😂
 
For local inferencing, unified memory size is everything and there are reliable apps, e.g. ollama built on llama.cpp, to help people do it. GPU acceleration via the Metal API. And there are many models freely available. However, A fp16 model quantized down to Q4, Q3, Q2 or lower performs worse and hallucinates more.

On Apple Silicon, people pick the quantized size that fit in memory (amount dedicated to the GPU can be varied a bit using sudo sysctl iogpu.wired_limit_mb), so it runs 100% in GPU (apart from the initial stage). But lower Q means poorer performance.

For example, I can download and run the Llama 3.3 70B Q4_K model in GPU memory on a 64GB M1 Max with a reasonable number of tokens/sec, but I can't run the Q6 (never mind the Q8) model this way. I tried Q6 but it can't fit, llama3.3:70b-instruct-q6_K takes 59 GB and is partitioned 13%/87% CPU/GPU and probably swaps, so it's dog slow.

Some people use dual AMD EPYC (CPU) to get lots of memory, but NUMA probably slows down a dual-CPU, and it's very slow.

Someone with Apple Silicon 128GB or 192GB can run meaningfully large models locally. For local inferencing, 256GB or 512GB or 1TB unified memory would be the way to go. Of course more GPU cores helps too. (I wish I had access to a 128MB M4 Max.) EXO can help spread a model across multiple Macs, but that's slow due to communication (even over Thunderbolt 5).

As for training of meaningfully large models, that's just way beyond what the average consumer or even university level project can do. That's for large corporations as the cost of electricity and hardware would be mind-blowing, even with DeepSeek optimizations. Each H800-SXM5-80MB (80MB VRAM; 528 tensor cores) costs nearly $30K and you need a lot of them. I read somewhere that electricity is at $2/hr with these (haven't checked this) and you need millions of hours. Apple is a consumer company, and I doubt they want to go after that market. And Apple has dragged its feet with AI, that means I think there is a faction within Apple that thinks AI is a fad, and the bubble will pop when AI hits a ceiling of performance. And locally-run AI models (on an iPhone or iPad) will solve privacy issues, but be very limited due to memory size.
 
Last edited:
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.