Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
Many of you might find this interesting as an overview of what's present in Hopper matrix multiply.
https://www.aleksagordic.com/blog/matmul
Some of you will recognize, as they move to the highest performing kernels, the elements I've suggested are still missing from the A19 - dedicated TMA, async DMA to fill it, being able to bypass GPU registers, things like that.

Of particular interest to @leman I draw attention to figure 39, which strongly suggests that, as of Hopper anyway (maybe earlier), they have pivoted from dot product HW to outer product HW.

Thanks for this, it looks very informative and detailed! I’ll make sure to study it in more depth over the next few days.

I’ve been also reading the Hopper PTX documentation after our earlier conversation and I now believe to understand better what you are saying. This new architecture appears to combine the earlier tensor cores with an SME-like tile storage. The way I understand it that the tensor units still use dot product hardware and take inputs from registers, but the output goes into dedicated tile storage using separate data lanes. It’s very similar to n-way SME outer products, but even wider. Definitely seems like a smart way to approach this.
 
  • Like
Reactions: name99
Systolic array is a particular type of technology, a particular way to layout functionality.
It is not a synonym for "matrix multiplication". There are MANY ways to perform high speed matrix multiply.
Apple uses very different techniques for the SME coprocessor, the GPU, NEON, and ANE.
nV has probably changed the details of how it does tensor core compute over the years.

Honestly this is not an especially interesting question. Far more important in terms of progress is offloading ever more of the dumb work surrounding matrix multiplication to autonomous hardware. For example the BIG recent improvements in tensor core have come from providing a local memory (TMA) that feeds the tensor core, and autonomous hardware that moves data into/out of the TMA, rather than relying on the rest of the GPU core to do that work. Apple has essentially the same sort of design in ANE (and did it first), but on the GPU side they are where nV was five years ago, still requiring data for the Neural Accelerator part of the GPU to pass through standard GPU registers and load/store.

FP8 is also not especially interesting.
nV evolved their hardware along a path for which each step made sense - start tensor core as FP32/FP16, then next year add FP8, then a year or two later add FP4. Each step was the easiest way to modify what existed in a minimal fashion while increasing compute. But easiest does not mean BEST...
Is FP8 the best way to balance dense compute, precision, and dynamic range? Especially given the at least two different versions, each trying to balance these issues? Compared to linear INT8 or even better INT8 going through a lookup table (ie 8b quantized)? Almost certainly not. But adding a lookup table path into the existing tensor core HW is tough; modifying the FP16 multipliers and adders to give you two FP8s is easier.
With FP4 it's even more stark. No-one pretends FP4 is a sensible format; a 16 element lookup table is better in every possible way -- EXCEPT difficulty of adding to the existing tensor core.

nV are a very smart company; I have tremendous respect for them. But, like everyone, they are subject to the constraints of time and history. In a perfect world they'd probably love to redo the tensor core, ripping out FP4 and FP8, and replacing them with a lookup table design. But they can't do that, they have to continue to support what was added in the past, at least for a while. Just because nV do things a certain way, which means much of the ecosystem does them a certain way, doesn't mean that way is optimal. Something like their structured sparsity seems like a bust. It hangs around because it makes all their numbers look twice as large, but it's rare that you find a paper using it (at least that's been my experience).

Apple on ANE has gone a different path. This might be because they had an early vision of how this stuff would play out (tighter interaction between their SW and HW design, rather than nV trying to be everything to everyone), or perhaps they could more easily drop stuff that wasn't useful, or maybe just dumb luck?
But their model is very much more based on integer quantizations (8, 6, 4, even 3 or 2b) that route through a lookup table to either an FP16 or INT8 value which then goes through the multiplier-adders. This gets them less dense compute than nV, but the same memory bandwidth advantages, and memory bandwidth is the bigger constraint.
Replacing structured sparsity, they have a scheme that allows for very low density (say .1% to 10%) outliers that can be represented by a bitmap and added into a layer as required. Since this is not an nV capability, you won't see it discussed in papers, but it appears to work really well when models are modified to take advantage of it.
So bottom line is Apple tends to get a lot more out of their compute than nV, even though it superficially looks weaker.

They seem to like this scheme enough that they're brought elements of it to the GPU.
The part that's clear is that the GPU "Neural Accelerator" is accelerated INT8*INT8 multiplication, not FP8 support. If I'm correct, the logical things to add over the next few years are not more formats (like FP8) but better memory support, eg equivalent of something like TMA to feed the Neural Accelerator, rather than going through GPU registers, also table lookup support, maybe even something like the sparse bitmap support to allow for separate tracking out outliers?
Not interesting to you maybe...from a computer arithmetic and numerical analysis perspective it actually is quite interesting. I wouldn't go around telling other people what their interests are.
 
I have been interested to see both Apple’s claim, reflected in MacRumor’s recent M4/M5 article, that the M5 has “4x+ peak GPU compute performance for AI” (which is, I appreciate, a marketing statement) and, more particularly, @leman ’s FP16 measurements for the A19 5-core GPU. I now see that there are starting to be some Geekbench AI numbers.

For example, it looks like, for half precision, the M5 (in the MacBook Pro) is achieving something over 41,000 for “core ML neural engine” and just under 25,000 for “core ML GPU”, versus around 36,000 and 11,000 for the M4.

Does this suggest that Geekbench AI may not be fully representing the more significant gains that the M5 does seem to have made in specific areas such as FP16? At any rate, Apple presumably used something other than these particular aspects of Geekbench AI to produce the 4x claim.

(I appreciate that not all workloads will benefit from the M5’s biggest gains but, to the limited extent that I can judge without actually trying it for myself, it does seem to me that the M5 will deliver a pretty substantial boost to Apple’s AI proposition, albeit with reservations about other aspects.)
 
  • Like
Reactions: crazy dave
Geekbench AI result to appear to be an oddity. They are consistent with the general shader core improvements (2x FP16), but not with the Neural Accelerators (which should be 4x FP16 and ~ 8x INT8).
I did ask JFPoole if any of his benchmarks would need an update to take advantage or the Neural Accelerators. Not surprisingly he didn’t answer.
 
Geekbench AI result to appear to be an oddity. They are consistent with the general shader core improvements (2x FP16), but not with the Neural Accelerators (which should be 4x FP16 and ~ 8x INT8).
Thanks for the speedy reply. That's pretty much what I figured. I suppose it's conceivable that Geekbench AI does not make use of Core ML (assuming that's what their "Core ML" benchmark does) in a way that uses FP16? (Their Core ML figure seems to be a little over 2x that of the M4, but obviously not 4x.) I'd be interested to know what workload Apple measured to get their 4x figure, though.

Being parochial, with respect to some of the ways in which I use GPUs, I now just need to figure out a cute way of using MTLTensor to do uint64 arithmetic .... :oops:

(And I share @OptimusGrime 's lack of surprise at the absence of an answer to the Neural Accelerator question. I'll be interested to take a look at the N.A. implementation that @senttoschool posted, though.)
 
Last edited:

Per this, MLX isn't yet updated to fully leverage the M5 design, so that may account for some of the discrepancies between testing and real world usage.
 
  • Like
Reactions: M4pro

Per this, MLX isn't yet updated to fully leverage the M5 design, so that may account for some of the discrepancies between testing and real world usage.
IMG_7189.jpeg
 
Being parochial, with respect to some of the ways in which I use GPUs, I now just need to figure out a cute way of using MTLTensor to do uint64 arithmetic .... :oops:

Sounds expensive. What's your use case? I'd wager that using native 64-bit support is probably considerably faster than trying to build partial products using 8-bit chunks using the MXU.
 
Sounds expensive. What's your use case? I'd wager that using native 64-bit support is probably considerably faster than trying to build partial products using 8-bit chunks using the MXU.
Apologies for not having replied sooner.

I’m sure you’re correct. I haven’t examined the compiler output, but I have always assumed that 64-bit integer operations are achieved by operating on 32-bit chunks. So it would be extreme vanity for me to think I could even come close to that — and certainly not by using tensor operations (that was a bit of a tongue in cheek remark, I’m afraid, sorry!).

The two things that make me interested in 64-bit (and longer) integer and bitwise operations are modular arithmetic on 64-bit integers (in order to do number-theoretic transforms — effectively the modular arithmetic analogue of FFTs) and addition / multiplication of what are, loosely speaking, binary matrices (hence the relevance of bitwise operations).

Thus far, I have been content to use uint64 and to simply let the compiler do its stuff. That works fine, but I have been wondering whether I should try something different. I’m interested to see that, in the Blackwell generation of GPUs, Nvidia have switched to having ‘unified' shader cores that are all capable of performing either int32 or FP32 operations in a given clock cycle (whereas previously, only half of the cores were int32-capable). I have admittedly, not had time to do a huge amount of searching, but I haven’t found any detail of how Apple Silicon performs for int32 or bitwise, relative to FP32. I suppose I just need to pony up for a Blackwell-based system at some point but, right now, I’d struggle to justify that to myself, given that I’d also like to replace my 2013 Mac Pro with an M5 machine in the not too distant future.

With further apologies for going rather hopelessly off-topic.
 
I have admittedly, not had time to do a huge amount of searching, but I haven’t found any detail of how Apple Silicon performs for int32 or bitwise, relative to FP32. I suppose I just need to pony up for a Blackwell-based system at some point but, right now, I’d struggle to justify that to myself, given that I’d also like to replace my 2013 Mac Pro with an M5 machine in the not too distant future.

I think I might help with that, at least a little bit :)

- Apple Silicon GPUs have a dedicated int32 multiply-accumulate data path that executes MAC instructions
- Before A19/M5 the throughput is 8 32-bit MAC/cycle (quarter of fp32 FMA rate), A19/M5 upgraded this to 16 32-bit MAC/cycle (half of fp32 FMA rate)
- Integer addition is done at full SIMD width of 32 32-bit ADD/cycle
- I have not tested 64-bit multiplication, but I assume they are done via two MACs, so the throughput should be 8/cycle on A19/M5
- M3 and higher are able to concurrently execute int and fp instructions (dual issue). You are still limited to a single int32 instruction.
- I have not tested bitwise operations, but popcnt has the same throughput as MAC
 
  • Like
Reactions: crazy dave
I think I might help with that, at least a little bit :)

- Apple Silicon GPUs have a dedicated int32 multiply-accumulate data path that executes MAC instructions
- Before A19/M5 the throughput is 8 32-bit MAC/cycle (quarter of fp32 FMA rate), A19/M5 upgraded this to 16 32-bit MAC/cycle (half of fp32 FMA rate)
- Integer addition is done at full SIMD width of 32 32-bit ADD/cycle
- I have not tested 64-bit multiplication, but I assume they are done via two MACs, so the throughput should be 8/cycle on A19/M5
- M3 and higher are able to concurrently execute int and fp instructions (dual issue). You are still limited to a single int32 instruction.
- I have not tested bitwise operations, but popcnt has the same throughput as MAC
That really is most helpful. Thank you so much for taking the time to reply so quickly.

At the risk of pushing my luck, I wonder if I might ask another question:

Is there some documentation that provides this level of information for Apple Silicon?

(It’s perfectly possible that I haven’t looked closely enough or am simply being dim and have missed something obvious. I suppose another approach might be to determine threadExecutionWidth for particular operations.)
 
At the risk of pushing my luck, I wonder if I might ask another question:

Is there some documentation that provides this level of information for Apple Silicon?

Not that I am aware of.

What I write is based on my own microbenchmarking and analysis. I haven’t published this stuff anywhere. I’m quite confident about the results because this since these shaders are very simple, but there are substantial limits to what can be tested easily.
 
Not that I am aware of.

What I write is based on my own microbenchmarking and analysis. I haven’t published this stuff anywhere. I’m quite confident about the results because this since these shaders are very simple, but there are substantial limits to what can be tested easily.
Thank you again. Given some of the measurements that you posted a few weeks ago, I did actually wonder whether that might be the case. Good for you making the effort to do such testing; I do appreciate that it’s not a trivial matter to do so accurately and easily, as you say, not least because of the need to avoid any use of the cache (never mind main memory).

The absence of any proper documentation from Apple about these sort of aspects of their chip architecture and performance is one of the things that I find slight disappointing, particularly as it would be so easy for them to produce it.
 
Thank you again. Given some of the measurements that you posted a few weeks ago, I did actually wonder whether that might be the case. Good for you making the effort to do such testing; I do appreciate that it’s not a trivial matter to do so accurately and easily, as you say, not least because of the need to avoid any use of the cache (never mind main memory).

For arithmetic microbenchmarks not hitting the memory is fairly easy, the bigger problem is fighting the compiler and the undefined behavior. Since we don't have access to a lower-level programming interface (even the LLVM-based air code goes though an optimizer pass before being converted to the final binary), one sometimes has to jump though weird hoops to avoid the code being hoisted or eliminated outright. Luckily, these moments are easy to spot.

The absence of any proper documentation from Apple about these sort of aspects of their chip architecture and performance is one of the things that I find slight disappointing, particularly as it would be so easy for them to produce it.

I agree. Nvidia goes to a great length to provide detailed documentation on GPU behavior and optimization, and they manage to do so without divulging the trade secrets. I'd say that the lack of documentation - not to mention the current defects - don't do much to help adoption on Apple platforms.
 
  • Like
Reactions: crazy dave
For arithmetic microbenchmarks not hitting the memory is fairly easy, the bigger problem is fighting the compiler and the undefined behavior. Since we don't have access to a lower-level programming interface (even the LLVM-based air code goes though an optimizer pass before being converted to the final binary), one sometimes has to jump though weird hoops to avoid the code being hoisted or eliminated outright. Luckily, these moments are easy to spot.



I agree. Nvidia goes to a great length to provide detailed documentation on GPU behavior and optimization, and they manage to do so without divulging the trade secrets. I'd say that the lack of documentation - not to mention the current defects - don't do much to help adoption on Apple platforms.
Yup I've often said that while OpenCL suffered from a lot of problems one of the main reasons CUDA "won" was CUDA's relative user friendliness and that Nvidia made sure to provide not only great documentation, but also a huge number of teaching resources to get it adopted. It certainly helped me!
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.