Aren't the tensor cores systolic arrays?
Also, any idea or insight if apple will add fp8 hardware?
Systolic array is a particular type of technology, a particular way to layout functionality.
It is not a synonym for "matrix multiplication". There are MANY ways to perform high speed matrix multiply.
Apple uses very different techniques for the SME coprocessor, the GPU, NEON, and ANE.
nV has probably changed the details of how it does tensor core compute over the years.
Honestly this is not an especially interesting question. Far more important in terms of progress is offloading ever more of the dumb work surrounding matrix multiplication to autonomous hardware. For example the BIG recent improvements in tensor core have come from providing a local memory (TMA) that feeds the tensor core, and autonomous hardware that moves data into/out of the TMA, rather than relying on the rest of the GPU core to do that work. Apple has essentially the same sort of design in ANE (and did it first), but on the GPU side they are where nV was five years ago, still requiring data for the Neural Accelerator part of the GPU to pass through standard GPU registers and load/store.
FP8 is also not especially interesting.
nV evolved their hardware along a path for which each step made sense - start tensor core as FP32/FP16, then next year add FP8, then a year or two later add FP4. Each step was the easiest way to modify what existed in a minimal fashion while increasing compute. But easiest does not mean BEST...
Is FP8 the best way to balance dense compute, precision, and dynamic range? Especially given the at least two different versions, each trying to balance these issues? Compared to linear INT8 or even better INT8 going through a lookup table (ie 8b quantized)? Almost certainly not. But adding a lookup table path into the existing tensor core HW is tough; modifying the FP16 multipliers and adders to give you two FP8s is easier.
With FP4 it's even more stark. No-one pretends FP4 is a sensible format; a 16 element lookup table is better in every possible way -- EXCEPT difficulty of adding to the existing tensor core.
nV are a very smart company; I have tremendous respect for them. But, like everyone, they are subject to the constraints of time and history. In a perfect world they'd probably love to redo the tensor core, ripping out FP4 and FP8, and replacing them with a lookup table design. But they can't do that, they have to continue to support what was added in the past, at least for a while. Just because nV do things a certain way, which means much of the ecosystem does them a certain way, doesn't mean that way is optimal. Something like their structured sparsity seems like a bust. It hangs around because it makes all their numbers look twice as large, but it's rare that you find a paper using it (at least that's been my experience).
Apple on ANE has gone a different path. This might be because they had an early vision of how this stuff would play out (tighter interaction between their SW and HW design, rather than nV trying to be everything to everyone), or perhaps they could more easily drop stuff that wasn't useful, or maybe just dumb luck?
But their model is very much more based on integer quantizations (8, 6, 4, even 3 or 2b) that route through a lookup table to either an FP16 or INT8 value which then goes through the multiplier-adders. This gets them less dense compute than nV, but the same memory bandwidth advantages, and memory bandwidth is the bigger constraint.
Replacing structured sparsity, they have a scheme that allows for very low density (say .1% to 10%) outliers that can be represented by a bitmap and added into a layer as required. Since this is not an nV capability, you won't see it discussed in papers, but it appears to work really well when models are modified to take advantage of it.
So bottom line is Apple tends to get a lot more out of their compute than nV, even though it superficially looks weaker.
They seem to like this scheme enough that they're brought elements of it to the GPU.
The part that's clear is that the GPU "Neural Accelerator" is accelerated INT8*INT8 multiplication, not FP8 support. If I'm correct, the logical things to add over the next few years are not more formats (like FP8) but better memory support, eg equivalent of something like TMA to feed the Neural Accelerator, rather than going through GPU registers, also table lookup support, maybe even something like the sparse bitmap support to allow for separate tracking out outliers?