Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
It’s relative to what you could do before. Technically, yes, 32bit, but if you wanted to do i8 matmul on pre-A19, that’s what you had at your disposal.

I am preparing a more detailed write-up with the results (it’s slow-going since Apple documentation is absolutely terrible and does not actually match the shipped implementation). But the preliminary findings align with 4x improvement for fp16 with fp32 accumulate and around 5.5x for int8. While nvidia tensor cores are still considerably more performant for low-precision inference, Apple should at least be competitive with Blackwell for fp16 with fp32 accumulate.
Yeah in GB AI for my M3 I see that the neural engine and CPU had different results for full-precision, half-precision, and quantized (interestingly for the CPU half-precision was the best!) but the GPU was pretty even across the board. Now it will also be different.
 
  • Like
Reactions: OptimusGrime
Yeah in GB AI for my M3 I see that the neural engine and CPU had different results for full-precision, half-precision, and quantized (interestingly for the CPU half-precision was the best!) but the GPU was pretty even across the board. Now it will also be different.

Not surprising — ARM SIMD has native support for packed FP16, processing 8x half-precision values per SIMD. While packed int8 operations should be even faster, there is additional overhead for scaling etc.
 
  • Like
Reactions: crazy dave
Not surprising — ARM SIMD has native support for packed FP16, processing 8x half-precision values per SIMD. While packed int8 operations should be even faster, there is additional overhead for scaling etc.
Yep, i'm working on a project using ARM NEON SIMD intrinsics and the fact that i can access these fast simd registers is pretty nice. in fact even int8 arithmetic widened to int16 can beat fp16 in some of my experiments.
 
It’s relative to what you could do before. Technically, yes, 32bit, but if you wanted to do i8 matmul on pre-A19, that’s what you had at your disposal.

I am preparing a more detailed write-up with the results (it’s slow-going since Apple documentation is absolutely terrible and does not actually match the shipped implementation). But the preliminary findings align with 4x improvement for fp16 with fp32 accumulate and around 5.5x for int8. While nvidia tensor cores are still considerably more performant for low-precision inference, Apple should at least be competitive with Blackwell for fp16 with fp32 accumulate.
Why do you think the speed up for int8 is only 5.5x? Could it be that Apple wrapped up the design without anticipating that LLM inference moved to Q8/Q4 so quick?

Given that Nvidia is pretty much all in on Q8/Q4 for Blackwell and future generations, we can expect future LLMs to be predominantly Q4 inference. I suppose Apple will need to move quickly to optimize for Q4 inference on the GPU, maybe as soon as M6.
 
It’s relative to what you could do before. Technically, yes, 32bit, but if you wanted to do i8 matmul on pre-A19, that’s what you had at your disposal.

I am preparing a more detailed write-up with the results (it’s slow-going since Apple documentation is absolutely terrible and does not actually match the shipped implementation). But the preliminary findings align with 4x improvement for fp16 with fp32 accumulate and around 5.5x for int8. While nvidia tensor cores are still considerably more performant for low-precision inference, Apple should at least be competitive with Blackwell for fp16 with fp32 accumulate.
I’ve found the most recent few pages of this thread really interesting. It’s very good of you guys to take the time to share your insights.

For quite some time, I have been wondering about replacing my old dual-D700 2013 Mac Pro, which I used to use quite a bit for OpenCL programming. No urgency, as I have a nice MacBook Pro M3 Max and an Nvidia Jetson to keep me occupied (using Metal and CUDA) but, if and when such a thing exists, I could be rather tempted by an M5 Ultra system to replace the trash can — particularly if its lower precision performance makes it a viable option of 70b+ sized LLMs (as well as for “old fashioned” GPU compute stuff). So I have a couple of questions, one about these measurements done by @leman and one more general.

Firstly, if I understand correctly, these x4 FP16 and x5.5 int8 (10 TOPS) figures are for the A19 5 ‘core’ GPU. So is it reasonable to think that an 80 core GPU M5 Ultra could conceivably be in the 150 - 200 TOPS ballpark? (With the understanding that, at this point in time, this is all highly speculative.)

Secondly. I am curious to know just how sensible it would be consider a fully-loaded M5 system with 512GB or more memory as an alternative to an nV workstation-class GPU? (I’ve some sympathy for the view that, for local prototyping work at least, memory is just as much of a factor as raw TFLOPS/TOPS, so surely Apple’s unified memory model is quite significant?)

Thanks in advance for any thoughts (and with apologies for what are doubtless rather rudimentary questions).
 
Firstly, if I understand correctly, these x4 FP16 and x5.5 int8 (10 TOPS) figures are for the A19 5 ‘core’ GPU. So is it reasonable to think that an 80 core GPU M5 Ultra could conceivably be in the 150 - 200 TOPS ballpark? (With the understanding that, at this point in time, this is all highly speculative.)

I've now run a more comprehensive set of tests over more matrix dimensions, and what I see is roughtly

~ 1.5 TFLOPS matrix fp16 -> fp16/fp32 per GPU core
~ 2.7 TFLOPS matrix int8 -> int32 per GPU core

Let's scale it up to a hypothetical M5 Max or M5 Ultra and compare them with current Nvidia solutions (looking at dense matrices only)

GPUFP16 with FP32 accumulateFP16 with FP16 accumulateINT8 with INT32 accumulate
RTX 5070~ 62~ 123~ 250
RTX 5090 ~ 210~ 420~ 840
M5 Max~ 65~ 65~ 110
M5 Ultra~ 130~ 130~ 220


Secondly. I am curious to know just how sensible it would be consider a fully-loaded M5 system with 512GB or more memory as an alternative to an nV workstation-class GPU? (I’ve some sympathy for the view that, for local prototyping work at least, memory is just as much of a factor as raw TFLOPS/TOPS, so surely Apple’s unified memory model is quite significant?)

Thanks in advance for any thoughts (and with apologies for what are doubtless rather rudimentary questions).

I think at this point it is rather difficult to predict usability and performance. It will certainly be a boost for working with large models, and will tremendously improve Apple's value proposition for ML. Will it be good value though? Who knows. In some key areas (like fp16 with fp32 accumulate) Apple has reached parity with Nvidia, in some others they are behind, and reduced-precision inference (e.g. FP4/FP8) is missing entirely. I don't have any experience as ML engineer so it is impossible for me to say how this will interact with the higher memory capacity of Apple Silicon.

Another big factor is programmability. Nvidia is state of the art, their documentation is generally excellent, and they give you all the tools you need to build top-notch software. Apple's developer experience on the other hand is atrocious. Half of the tensor APIs described in the Metal 4 documentation either don't exist or crash when one tries to use them. The docs mention one set of function, the API has shipped with another. More crucially however, it seems insanely difficult to reach good performance with the current APIs. Nvidia gives you a set of primitives which you know will be fast, and they tell you the optimal transfer to compute ratios, so you can build your code around these things. Apple gives you a high-level API that is supposed to be user-friendly, but that means you have to guess which matrix shapes and settings work well and which don't. These things need to get much much better before Metal can be a considered a serious target for development.
 
I've now run a more comprehensive set of tests over more matrix dimensions, and what I see is roughtly

~ 1.5 TFLOPS matrix fp16 -> fp16/fp32 per GPU core
~ 2.7 TFLOPS matrix int8 -> int32 per GPU core

Let's scale it up to a hypothetical M5 Max or M5 Ultra and compare them with current Nvidia solutions (looking at dense matrices only)

GPUFP16 with FP32 accumulateFP16 with FP16 accumulateINT8 with INT32 accumulate
RTX 5070~ 62~ 123~ 250
RTX 5090~ 210~ 420~ 840
M5 Max~ 65~ 65~ 110
M5 Ultra~ 130~ 130~ 220




I think at this point it is rather difficult to predict usability and performance. It will certainly be a boost for working with large models, and will tremendously improve Apple's value proposition for ML. Will it be good value though? Who knows. In some key areas (like fp16 with fp32 accumulate) Apple has reached parity with Nvidia, in some others they are behind, and reduced-precision inference (e.g. FP4/FP8) is missing entirely. I don't have any experience as ML engineer so it is impossible for me to say how this will interact with the higher memory capacity of Apple Silicon.

Another big factor is programmability. Nvidia is state of the art, their documentation is generally excellent, and they give you all the tools you need to build top-notch software. Apple's developer experience on the other hand is atrocious. Half of the tensor APIs described in the Metal 4 documentation either don't exist or crash when one tries to use them. The docs mention one set of function, the API has shipped with another. More crucially however, it seems insanely difficult to reach good performance with the current APIs. Nvidia gives you a set of primitives which you know will be fast, and they tell you the optimal transfer to compute ratios, so you can build your code around these things. Apple gives you a high-level API that is supposed to be user-friendly, but that means you have to guess which matrix shapes and settings work well and which don't. These things need to get much much better before Metal can be a considered a serious target for development.
Interesting and sobering. I don’t know how often Apple hears that documentation is an issue, but apparently it isn’t enough.

Have I misremembered or did you previously think that the M5 Max would match the 5070 for FP16 with FP16 accumulate?
 
The 5090 is an equivalent of 170 Apple GPU cores - and it runs a much higher clock. That’s tough to match.
I don’t doubt it. As you say though. depending on your use case, it may present better value. I’m curious is Apple plans on matching the high end of the desktop space, or if they’re satisfied with the high end of the laptop space.
 
I don’t doubt it. As you say though. depending on your use case, it may present better value. I’m curious is Apple plans on matching the high end of the desktop space, or if they’re satisfied with the high end of the laptop space.
The GeForce RTX 5090 has a die size of 750mm2 (with 92.2 Billion transistors on a 5 nm process). The current limit for heterogeneous chips is about 858mm2.

There just isn't very much wiggle room to match that with the current approach and power budget. The M2 Max is around 510mm2 on 5nm with 67 Billion transistors.
 
I don’t doubt it. As you say though. depending on your use case, it may present better value. I’m curious is Apple plans on matching the high end of the desktop space, or if they’re satisfied with the high end of the laptop space.

I don't think Apple views the desktop and laptop markets as separate. Unlike Intel/AMD and Nvidia, Apple isn't making desktop and mobile variants of their chips. On the CPU side, the M-series already competes with top-tier desktop chips from AMD and Intel, while using significantly less power. The only SoC from Apple one could consider to be a desktop-only part would be the Ultra variants, given their size and cooling requirements.
 
I've now run a more comprehensive set of tests over more matrix dimensions, and what I see is roughtly

~ 1.5 TFLOPS matrix fp16 -> fp16/fp32 per GPU core
~ 2.7 TFLOPS matrix int8 -> int32 per GPU core

Let's scale it up to a hypothetical M5 Max or M5 Ultra and compare them with current Nvidia solutions (looking at dense matrices only)

GPUFP16 with FP32 accumulateFP16 with FP16 accumulateINT8 with INT32 accumulate
RTX 5070~ 62~ 123~ 250
RTX 5090~ 210~ 420~ 840
M5 Max~ 65~ 65~ 110
M5 Ultra~ 130~ 130~ 220




I think at this point it is rather difficult to predict usability and performance. It will certainly be a boost for working with large models, and will tremendously improve Apple's value proposition for ML. Will it be good value though? Who knows. In some key areas (like fp16 with fp32 accumulate) Apple has reached parity with Nvidia, in some others they are behind, and reduced-precision inference (e.g. FP4/FP8) is missing entirely. I don't have any experience as ML engineer so it is impossible for me to say how this will interact with the higher memory capacity of Apple Silicon.

Another big factor is programmability. Nvidia is state of the art, their documentation is generally excellent, and they give you all the tools you need to build top-notch software. Apple's developer experience on the other hand is atrocious. Half of the tensor APIs described in the Metal 4 documentation either don't exist or crash when one tries to use them. The docs mention one set of function, the API has shipped with another. More crucially however, it seems insanely difficult to reach good performance with the current APIs. Nvidia gives you a set of primitives which you know will be fast, and they tell you the optimal transfer to compute ratios, so you can build your code around these things. Apple gives you a high-level API that is supposed to be user-friendly, but that means you have to guess which matrix shapes and settings work well and which don't. These things need to get much much better before Metal can be a considered a serious target for development.
Thank you for taking the time to make such a speedy and comprehensive reply. Much appreciated.

I’m sure you’re right, that it is impossible to say whether (in 2026/27) Apple will represent good value for ML. I suppose it’s stating the obvious to say that, in terms of raw performance (whether “traditional” FP32 or tensor-style), it simply isn’t realistic for an Mx chip that also includes a CPU to compete with a more specialised Nvidia GPU that draws much more power (and may also have a higher clock rate). And let's not even talk about FP64. That does not mean that the M5 won’t be a good option for some ML (and even HPC) workloads — I think it will be — but the devil will be in the detail.

Sadly, I agree with what you say about documentation and tooling. My experience of converting my CUDA and OpenCL code to using Metal was by no means pain-free and involved rather more trial and error than I would have liked. Wheres I have been pretty impressed by Nvidia's software stack. At some point, I will have to stop being indecisive and plump for either an M5 Studio or Pro with lots of memory or an Nvidia workstation. As I will ultimately need to replace my 2013 trash can for other purposes, I’d quite like to be able to have just a single Apple box for everything, but the sensible choice may turn out to be Nvidia.

(I still can’t type or think “M5” without wanting to append “… multitronics unit" …. and I don’t mean the Audi CVT …. :) )
 
Aren't the tensor cores systolic arrays?

Also, any idea or insight if apple will add fp8 hardware?
Systolic array is a particular type of technology, a particular way to layout functionality.
It is not a synonym for "matrix multiplication". There are MANY ways to perform high speed matrix multiply.
Apple uses very different techniques for the SME coprocessor, the GPU, NEON, and ANE.
nV has probably changed the details of how it does tensor core compute over the years.

Honestly this is not an especially interesting question. Far more important in terms of progress is offloading ever more of the dumb work surrounding matrix multiplication to autonomous hardware. For example the BIG recent improvements in tensor core have come from providing a local memory (TMA) that feeds the tensor core, and autonomous hardware that moves data into/out of the TMA, rather than relying on the rest of the GPU core to do that work. Apple has essentially the same sort of design in ANE (and did it first), but on the GPU side they are where nV was five years ago, still requiring data for the Neural Accelerator part of the GPU to pass through standard GPU registers and load/store.

FP8 is also not especially interesting.
nV evolved their hardware along a path for which each step made sense - start tensor core as FP32/FP16, then next year add FP8, then a year or two later add FP4. Each step was the easiest way to modify what existed in a minimal fashion while increasing compute. But easiest does not mean BEST...
Is FP8 the best way to balance dense compute, precision, and dynamic range? Especially given the at least two different versions, each trying to balance these issues? Compared to linear INT8 or even better INT8 going through a lookup table (ie 8b quantized)? Almost certainly not. But adding a lookup table path into the existing tensor core HW is tough; modifying the FP16 multipliers and adders to give you two FP8s is easier.
With FP4 it's even more stark. No-one pretends FP4 is a sensible format; a 16 element lookup table is better in every possible way -- EXCEPT difficulty of adding to the existing tensor core.

nV are a very smart company; I have tremendous respect for them. But, like everyone, they are subject to the constraints of time and history. In a perfect world they'd probably love to redo the tensor core, ripping out FP4 and FP8, and replacing them with a lookup table design. But they can't do that, they have to continue to support what was added in the past, at least for a while. Just because nV do things a certain way, which means much of the ecosystem does them a certain way, doesn't mean that way is optimal. Something like their structured sparsity seems like a bust. It hangs around because it makes all their numbers look twice as large, but it's rare that you find a paper using it (at least that's been my experience).

Apple on ANE has gone a different path. This might be because they had an early vision of how this stuff would play out (tighter interaction between their SW and HW design, rather than nV trying to be everything to everyone), or perhaps they could more easily drop stuff that wasn't useful, or maybe just dumb luck?
But their model is very much more based on integer quantizations (8, 6, 4, even 3 or 2b) that route through a lookup table to either an FP16 or INT8 value which then goes through the multiplier-adders. This gets them less dense compute than nV, but the same memory bandwidth advantages, and memory bandwidth is the bigger constraint.
Replacing structured sparsity, they have a scheme that allows for very low density (say .1% to 10%) outliers that can be represented by a bitmap and added into a layer as required. Since this is not an nV capability, you won't see it discussed in papers, but it appears to work really well when models are modified to take advantage of it.
So bottom line is Apple tends to get a lot more out of their compute than nV, even though it superficially looks weaker.

They seem to like this scheme enough that they're brought elements of it to the GPU.
The part that's clear is that the GPU "Neural Accelerator" is accelerated INT8*INT8 multiplication, not FP8 support. If I'm correct, the logical things to add over the next few years are not more formats (like FP8) but better memory support, eg equivalent of something like TMA to feed the Neural Accelerator, rather than going through GPU registers, also table lookup support, maybe even something like the sparse bitmap support to allow for separate tracking out outliers?
 
You generally need hardware support.
Does Apple GPU have stochastic rounding? Done correctly, that seems to provide remarkable improvements in, let's call it "stability", basically the ability to still get a useful value (training or inference) even while using aggressive quantization/precision.
I know ANE does provide stochastic rounding, but I don't think I've ever seen a reference to it in the GPU context. Might be something to look for and see if you can toggle?
 
  • Like
Reactions: crazy dave
Not surprising — ARM SIMD has native support for packed FP16, processing 8x half-precision values per SIMD. While packed int8 operations should be even faster, there is additional overhead for scaling etc.
I thought GB AI "CPU" routed to SME/AMX.
NEON won't be utterly terrible, but SME should be substantially better (like 8x or so I think).
 
  • Like
Reactions: crazy dave
I don’t doubt it. As you say though. depending on your use case, it may present better value. I’m curious is Apple plans on matching the high end of the desktop space, or if they’re satisfied with the high end of the laptop space.
Apple would argue that Compute is easy, Memory (bandwidth and capacity) is hard. And that what you get with an M5 Ultra is an easier programming model because of that larger pool of memory.
Of course how much that matters to any particular use case depends on details.

If it's important that you "track" the nV world (eg use nV-optimized libraries and algorithms) it's probably not worth the hassle.
OTOH if you're more interested in doing stuff that's actually different from the nV world (eg exploring ideas that don't work well on nV for whatever reason) then the Apple path may make sense. For example interesting work has been done (look up KPOP optimizer) on using a rather different optimizer from the standard ADAM on Apple HW. KPOP works for Apple because it requires a little more memory, somewhat more compute, but a lot less communication, than ADAM, so it's a better match to say a small cluster of Apple HW.

Another direction you can go (as always, depends on your interests) is rather than buying a max-end mac, buy 4 mid-range macs (like mac minis) and explore what can be done not in one machine but in terms of a cluster (like the KPOP research).
 
  • Like
Reactions: crazy dave
Does Apple GPU have stochastic rounding? Done correctly, that seems to provide remarkable improvements in, let's call it "stability", basically the ability to still get a useful value (training or inference) even while using aggressive quantization/precision.
I know ANE does provide stochastic rounding, but I don't think I've ever seen a reference to it in the GPU context. Might be something to look for and see if you can toggle?

That's a really good question. I don't see anything in the documentation. I'll play around with some data and see if I notice any odd rounding behavior on FP16 data.

I thought GB AI "CPU" routed to SME/AMX.
NEON won't be utterly terrible, but SME should be substantially better (like 8x or so I think).

Great point! I entirely overlooked that.
 
  • Like
Reactions: name99
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.