8bit & 4bit acceleration on Apple Silicon GPU for LLMs

senttoschool · Apr 4, 2025

Some thing to discuss. Local LLM scene is growing. It causes Mac buyers to upgrade RAM, which Apple makes good margins in. Macs have unified memory and has an inherent advantage over discrete GPUs in LLM inference due to the larger capacity. However, Apple Silicon GPUs don't seem to have the compute to compete against RTX GPUs, or even AMD ones, in prompt processing and evaluation speed as long as the LLM model fits into the VRAM. This is because RTX GPUs have Tensor Core, which accelerate 8bit/4bit inference. Blackwell takes 4bit acceleration to the next level.

I hope Apple continues to invest in LLM support for their GPUs and add dedicated 8/4bit acceleration.

thenewperson · Apr 4, 2025

Fingers crossed this is work on that https://forums.macrumors.com/threads/apple-silicon-deep-learning-performance.2319673/post-33834199

leman · Apr 4, 2025

The Apple ANE (inference accelerator) appears to have hardware support for 8-bit calculations. Of course, it is not really suitable for very large language models.

My post in the other thread (linked by previous poster) describes a recently published Apple patent that appears to describe a GPU matrix computation engine not unlike the tensor core. The patent does not explicitly mention multi-precision operation, but it could be implied.

I think we can all agree that having faster GPU-driven matrix multiplication would considerably improve the Mac value for users and developers of ML models.

picpicmac · Apr 4, 2025

senttoschool said:
This is because RTX GPUs have Tensor Core

Nvidia boards primary advantage is the memory bandwidth, which is 2x to 4x that of even the M3 Ultra.

The Tensor core advantage means nothing if they can't be fed the data from memory fast enough.

The HBM memory that Nvidia uses is expensive (though Nvidia does a huge markup on their boards, then their partners do even more.)

senttoschool said:
or even AMD ones

From examples I've seen, if the LLM has to run on the AMD CPUs instead of the Nvidia board, the performance is worse that the Mac Ultra SoCs. Unless you mean the AMD graphics cards, which are getting better.

senttoschool · Apr 4, 2025

picpicmac said:
Nvidia boards primary advantage is the memory bandwidth, which is 2x to 4x that of even the M3 Ultra.

The Tensor core advantage means nothing if they can't be fed the data from memory fast enough.

Prompt processing depends on compute. Time to first token basically. I made that clear in the original post.

Think of LLM inferencing as two parts:

1. Processing the context before token generation (time to first token)

2. Actually returning tokens (tokens/s)

Bandwidth is almost always a bottleneck, but Macs are especially bad at #1.

M3 Ultra is already bottlenecked by compute more than bandwidth for Deepseek R1 672b Q4. The theoretical tokens/s is around ~40 tokens/s based on the 819GB/s bandwidth. In reality, it's 17-19 tokens/s. Compute seems to be the bottleneck here.

Accelerating 8 & 4 bit on the GPU does two important things:

1. Remove compute bottleneck for a model like Deepseek R1

2. Significantly improve prompt processing speed, which can be minutes long right now depending on the model and context.

leman · Apr 5, 2025

picpicmac said:
Nvidia boards primary advantage is the memory bandwidth, which is 2x to 4x that of even the M3 Ultra.

The Tensor core advantage means nothing if they can't be fed the data from memory fast enough.

What if you can feed the data fast enough? Nvidia does “cheat” in the benchmarks (as in they overestimate the impact of their massive compute ability on real workloads has you state, the bottleneck remains the movement of the data). Still, compute does make a difference. When working on local computations, Nvidia can process the data as quickly as it comes from the cache, while Apple is compute-bound.

Nevertheless, I am optimistic that Macs will get much better at these things in short to mid-term. The new patents explicitly focus on optimizing data movement by caching intermediate dot product results, which could enable great performance even with limited compute.

picpicmac said:
The HBM memory that Nvidia uses is expensive (though Nvidia does a huge markup on their boards, then their partners do even more.)

Consumer boards use GDDR

novagamer · Apr 5, 2025

Has anyone seen updates in this area for 15.4 now that SME should be “turned on” in M4 per https://github.com/llvm/llvm-project/pull/119171 ?

I’m not sure what thread to ask this question in, pardon me if this isn’t a good spot.

senttoschool · Apr 5, 2025

Download Llama

Request access to Llama.

www.llama.com

The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation

We’re introducing Llama 4 Scout and Llama 4 Maverick, the first open-weight natively multimodal models with unprecedented context support and our first built using a mixture-of-experts (MoE) architecture.

ai.meta.com

Looks to be a great model for Macs. ~55GB for Q4. 17B active params. Should yield around 30 tokens/s for an M4 Max.

Just need more compute so longer contexts aren't going to take minutes to process. 10M context window is sort of insane.

400b model should fit fully within an M3 Ultra 512GB.

picpicmac · Apr 7, 2025

leman said:
Consumer boards use GDDR

The unit fka "Sparks" and the 6000 Blackwell are advertised as having HBM as well, don't they?

picpicmac · Apr 7, 2025

senttoschool said:
The theoretical tokens/s is around ~40 tokens/s based on the 819GB/s bandwidth.

The thing we need to keep in mind about the Ultra chips is that they are multiprocessors.

Each half has four memory controllers, so the M3 like the M2 before it should be thought of as 410GB/s x 2.

The bridge between both sides is claimed by Apple to have over a TB/s of bandwidth, but remember that has to serve both ways, and it's not just for the GPU cores.

picpicmac · Apr 7, 2025

picpicmac said:
The unit fka "Sparks" and the 6000 Blackwell are advertised as having HBM as well, don't they?

Answering my own question: HBM doesn't get used until you step up to the DGX Workstation, the DGX Spark being just an overpriced little box of LPDDR.

dmccloud · Apr 7, 2025

senttoschool said:
Some thing to discuss. Local LLM scene is growing. It causes Mac buyers to upgrade RAM, which Apple makes good margins in. Macs have unified memory and has an inherent advantage over discrete GPUs in LLM inference due to the larger capacity. However, Apple Silicon GPUs don't seem to have the compute to compete against RTX GPUs, or even AMD ones, in prompt processing and evaluation speed as long as the LLM model fits into the VRAM. This is because RTX GPUs have Tensor Core, which accelerate 8bit/4bit inference. Blackwell takes 4bit acceleration to the next level.

I hope Apple continues to invest in LLM support for their GPUs and add dedicated 8/4bit acceleration.

Once you start using larger LLM models the RTX advantage quickly disappears. Alex Ziskind did some head-to head testing between a Mac Studio running the M3 Ultra and 512GB RAM against a Windows PC with a 5090 and 192GB system RAM. Once the model was to large to run entirely within the GPU, the 5090 solution quickly lost ground against the Mac Studio. This advantage remained present even if the Mac Studio was set to use the same system RAM/GPU split as the RTX 5090 machine due to how unified memory works under Mac OS.

novagamer · Apr 8, 2025

dmccloud said:
Once you start using larger LLM models the RTX advantage quickly disappears. Alex Ziskind did some head-to head testing between a Mac Studio running the M3 Ultra and 512GB RAM against a Windows PC with a 5090 and 192GB system RAM. Once the model was to large to run entirely within the GPU, the 5090 solution quickly lost ground against the Mac Studio. This advantage remained present even if the Mac Studio was set to use the same system RAM/GPU split as the RTX 5090 machine due to how unified memory works under Mac OS.

This doesn’t negate what they said, Apple does need to include much faster Matrix operations specifically in their GPU cores which hopefully is coming with M5.

Apple currently has an advantage due to the unified memory architecture but it isn’t nearly as fast as other solutions when models fit within graphics memory. Their application of NPU is also (imo) abysmal at the high-end of their line given they don’t scale it with any chip other than the Ultra and that’s only because they’re using 2 dies. My iPhone should not have a faster NPU than my $7,000 laptop.

I think they’ll rectify this but we will see, if they hold out to M6 / M7 that would be a colossal mistake and they will lose any advantage they have in this space because AMD is certainly going to bring their unified memory architecture to desktops unless the AI bubble bursts very soon which seems unlikely (I expect 5 years or so, not 2-3).

If SoIC is real and coming with M5 I think a lot of stuff may change, but that’s also me projecting my own wants / needs.

senttoschool · Apr 11, 2025

This guy hooked up 3x M4 Max 128GB Studios together vs 1x M3 Ultra 512GB for LLMs.

leman · Apr 11, 2025

novagamer said:
Their application of NPU is also (imo) abysmal at the high-end of their line given they don’t scale it with any chip other than the Ultra and that’s only because they’re using 2 dies. My iPhone should not have a faster NPU than my $7,000 laptop.

Just a quick comment on this: the NPU is primarily there to support various functionality of the phone and the OS (e.g. camera, FaceID, FaceTime, photo, text summaries). I doubt the NPU is positioned as THE solution to all ML inference needs. If anything, your iPhone is using the NPU more than your laptop.

I'd rather see the die area be invested in GPU ML, to be honest.

novagamer · Apr 11, 2025

leman said:
Just a quick comment on this: the NPU is primarily there to support various functionality of the phone and the OS (e.g. camera, FaceID, FaceTime, photo, text summaries). I doubt the NPU is positioned as THE solution to all ML inference needs. If anything, your iPhone is using the NPU more than your laptop.

I'd rather see the die area be invested in GPU ML, to be honest.

Yeah, I agree die area would be better fit for the GPU since you can make use of them more generally and their first priority should be fixing some of the shortcomings with those cores, then increasing the amount of therm.

I do think there’s some capability / potential for NPU scaling if they’d offer it but I also would not make that tradeoff either. There’s kind o a chicken / egg problem where if you don’t have a scalable NPU you’ll never get cutting-edge features for it that make good use of what could be a much more powerful one.

From one perspective, like with the summaries (that I don’t love), they know they can target the entire line and have a performance baseline. From another, if they had much more powerful hardware to work with on the higher end Pro / Max chips, which still sell in pretty high volume, perhaps more complex features could be developed to target those models.

Then again Apple shipped an entire FPGA card and never let anyone actually, you know, program it. My expectations may be too high with that sort of culture in mind, but I hope it’s changing given the recent staff shakeups.

leman · Apr 11, 2025

novagamer said:
Yeah, I agree die area would be better fit for the GPU since you can make use of them more generally and their first priority should be fixing some of the shortcomings with those cores, then increasing the amount of therm.

I do think there’s some capability / potential for NPU scaling if they’d offer it but I also would not make that tradeoff either. There’s kind o a chicken / egg problem where if you don’t have a scalable NPU you’ll never get cutting-edge features for it that make good use of what could be a much more powerful one.

From one perspective, like with the summaries (that I don’t love), they know they can target the entire line and have a performance baseline. From another, if they had much more powerful hardware to work with on the higher end Pro / Max chips, which still sell in pretty high volume, perhaps more complex features could be developed to target those models.

Then again Apple shipped an entire FPGA card and never let anyone actually, you know, program it. My expectations may be too high with that sort of culture in mind, but I hope it’s changing given the recent staff shakeups.

To clarify, I think that there is place for both the NPU and ML accelerators on the Mac, and I am fine with the NPU remaining relatively small. There is a good division of labour between the two.

On the other hand, I am not super sure about the utility of CPU blocks such as AMX/SME. They are great for accelerating linear algebra routines for scientific computing, but I just don't see them as a very interesting tool for ML unless they are made much much larger.

izzy0242mr · Apr 12, 2025

senttoschool said:
Prompt processing depends on compute. Time to first token basically. I made that clear in the original post.

Think of LLM inferencing as two parts:

1. Processing the context before token generation (time to first token)

2. Actually returning tokens (tokens/s)

Bandwidth is almost always a bottleneck, but Macs are especially bad at #1.

M3 Ultra is already bottlenecked by compute more than bandwidth for Deepseek R1 672b Q4. The theoretical tokens/s is around ~40 tokens/s based on the 819GB/s bandwidth. In reality, it's 17-19 tokens/s. Compute seems to be the bottleneck here.

Accelerating 8 & 4 bit on the GPU does two important things:

1. Remove compute bottleneck for a model like Deepseek R1

2. Significantly improve prompt processing speed, which can be minutes long right now depending on the model and context.

Would you recommend against a Mac for running LLMs locally until/unless this is fixed? Or do you think it's still worth it potentially? And would it only make sense with a maxed out Ultra chip or would a Max model with enough RAM be potentially worthwhile?

senttoschool · Apr 12, 2025

izzy0242mr said:
Would you recommend against a Mac for running LLMs locally until/unless this is fixed? Or do you think it's still worth it potentially? And would it only make sense with a maxed out Ultra chip or would a Max model with enough RAM be potentially worthwhile?

It's worth it potentially right now. It's the ONLY option for very large local LLMs at any decent speed.

But it can be so much better with more 8/4bit acceleration.

tenthousandthings · Apr 14, 2025

picpicmac said:
Answering my own question: HBM doesn't get used until you step up to the DGX Workstation, the DGX Spark being just an overpriced little box of LPDDR.

And even that is a combination, with everything but the Blackwell GPU carrying LPDDR:

One interesting thing about Nvidia’s marketing is how it portrays NVLink-C2C. This is their implementation(s) of TSMC CoWoS advanced packaging. They are understandably vague about what exactly they are doing with it, and the mix of technologies, processes, and generations varies depending on the product.

For example, I could definitely be wrong, but I think the 900 GB/s NVLink-C2C spec provided in the image above indicates the new (not yet available) DGX Station incorporates fourth-generation NVLink, and not the latest-and-greatest fifth-generation.

My guess is Apple will take the DGX Spark initiative seriously, as it were, and M5 Pro/Max/Ultra will reflect that. It encroaches on the Mac Studio in a very limited way, but nonetheless, it’s a shot across the bow. I can’t see Apple (and/or AMD) backing down. Nobody but Nvidia wants another proprietary CUDA situation.

8bit & 4bit acceleration on Apple Silicon GPU for LLMs

macrumors 68030

macrumors 65816

macrumors Core

macrumors 68000

macrumors 68030

macrumors Core

macrumors 6502

macrumors 68030

macrumors 68000

macrumors 68000

macrumors 68000

macrumors 68040

macrumors 6502

macrumors 68030

macrumors Core

macrumors 6502

macrumors Core

macrumors 6502a

macrumors 68030

Contributor

Our Staff