OK guys, I think I know what happened with the new GPUs. It's NOT addition of a "tensor core" and it's not addition of quantization support (or for that matter sparsity support). It's the addition ofEagerly awaiting a tech talk for these gpus, similar to the one they made for the A17/M3.
1. a third FP pipeline (so we we now have an FP16 pipelines, an FP32 pipeline, and a mixed pipeline, which multiplies FP16 with FP16, but accumulates to FP32)
2. modification of the FP32 pipeline so that it can take in FP16 results and perform FP16 arithmetic
3. the addition I described earlier of a local register attached to each pipeline cutting down the back and forth traffic between the register cache and the pipelines when performing dot products (eg as a component of matrix multiply).
Put these together and the net result is that
- a lot of generic and graphics code will see a nice boost because FP16 ops can execute in two, rather than one, pipeline AND
- matrix multiply code can take advantage of all the above features to in fact schedule FP16 matrix multiply across three pipelines rather than one, giving us the 3x factor that Apple talked about in the A19 presentation.
My current guess is that this is the technical meaning of the GPU "neural accelerator". Not a tensor core, and not support any sort of quantization. Maybe those will come, I have no idea. Or maybe Apple felt it was most important to prioritize flexibility over specific techniques (like quantization or structured sparsity) that right now seem very important but which may be replaced by something else in two or three years?
The local register patent was https://patents.google.com/patent/US20250103292A1
My writeup for the new patent is ...................................................................
Suppose you want to increase your GPU compute, but not to spend much area, so you can't increase the number of cores. What are your options? The most obvious choice, as we have already discussed, is to go superscalar to 2-wide, so that more of your execution units can be active every cycle. Apple's first pass at this, presumably just getting infrastructure in place, had a disappointing performance boost because while three execution pipes are available (FP32, FP16, and INT) INT is rarely used, so its dual issue doesn't help much, and it's rare to have either a single warp that uses both FP16 and FP32, or two independent warps that use these two different types simultaneously.
Other GPUs have either modified their pipelines so that the same pipeline can handle both FP16 and FP32, or even more ambitiously, so that the same pipeline can handle FP32 and dual FP16 operation. Both of these make sense. An FP32 unit has to have a multiplier and an adder/shifter, and running those slightly less wide gives you essentially FP16; while you need to add a little more logic to get dual FP16 (ie an FP16 pair) but you can transport the data into/out of the execution unit along the existing 32-bit wide path to the operand cache, so you're making better use of the operand cache, which is one of your tightest resources.
Apple has so far avoided this path, saying in patents that to save power they wanted the FP16 and FP32 pipes to be maximally energy efficient with no extra overhead.
This changes with (2022) https://patents.google.com/patent/US12405803B1 Superscalar execution using pipelines that support different precisions, which appears to be one of the features we know is present in the A19, and presumably coming in the M5.
So the above are, in a sense the obvious elements of the patent. There's one less obvious element.
The patent speaks of each SIMD possessing not two but three pipelines, for FP16, FP32, and "mixed precision" meaning FP16 and FP32. Essentially we seem to have pipelines specialized for
- FP16 only,
- FP32 only, and
- FP32=FP16+FP16*FP16 and FP32=FP32+FP16*FP16 (so FP16 multiplies, but accumulating to FP32)
So it seems that in fact
- the amount of logic has increased (the additional, mixed, pipeline added)
- under different circumstances (wider instruction issue, or specialized instructions) not two but three FP16 operations might be possible per cycle.
The A19 presentation talked about "Neural Accelerators" added to the GPU, and 3x the "OPs" (whatever that might refer to) possible per cycle. One possible interpretation of this is that for LLM purposes, executing mostly matrix multiples in FP16, we go from having one FP16 pipeline available to three pipelines available; and while generic FP16 code cannot access all three simultaneously, the specific matrix multiply instructions (making use of the local per FMA register we discussed in the previous patent) can use all pipelines simultaneously?