I’m fascinated by the use of NPU TOPs as if the GPUs on these Macs doesn’t blow the NPUs out of the water. Unless you think an iPhone 16 (35 TOPs) and an M4Max (38 TOPs) have broadly the same ML performance…
In theory, this is why tech bloggers have a job, where they test the real world performance. Which tasks are handled by the NPU, which by the GPU? When is the RAM bandwidth the limiting factor in either case? These are genuine questions and I would love it if someone can answer them.
Also: has Apple actually improved the NPU that much in one generation (M3 was 18 TOPs) or did they drop to INT8 or even INT4 calculations? Are the GPU and NPU capable of running the same precision?
Geekbench AI, though I am sure not perfect, provides a nice opportunity to measure the Mac's different subsystems for AI-type with different types:
System | Subsystem | FP32 | FP16 | INT8 |
---|---|---|---|---|
Apple M4 Max | Core ML GPU | 20383 | 22515 | 21447 |
Apple M4 Max | Core ML CPU | 5514 | 9086 | 7028 |
Apple M4 Max | Core ML Neural Engine | 5515 | 38197 | 52421 |
Apple M3 Ultra | Core ML Neural Engine | 5489 | 30100 | 33219 |
Apple M3 Ultra | Core ML GPU | 23270 | 25750 | 23591 |
Apple M3 Ultra | Core ML CPU | 5497 | 8413 | 6609 |
What's interesting here is that the "best" choice of system depends on the type and subsystem you plan to use and similarly the data type one chooses may depend on the system they are on.
For example, for calculations that can use INT8 just as well as any, using the NPU of the M4 Max is clearly the fastest. An M3 Ultra would be a downgrade. However, for calculations that need FP32, the M3 Ultra is somewhat faster (though maybe not by enough to justify its higher cost, the large memory option of the M3 Ultra will probably come in handy with the larger memory requirements of FP32). At that point I am also switching from the NPU to the GPU.
On the other hand, I might prefer FP16 over INT8 on the M3 Ultra if my model converged faster (required fewer iterations) but depending that might not hold on the M4 Max (depending on how many more iterations are required).
Then for tasks that can use INT8 on the NPU, the latest iPhone may be ~ Studio M4 Max as others have pointed out, and that is kind of crazy. I suspect for FP32 calculations that can run on the M3 Ultra's 80xGPU, the iPhone won't keep up...
Last, though tools like Geekbench AI test their calculations across different data types and lets one preference the subsystem, I understand Apple's CoreML framework makes the final decision. From the data above, it appears likely that FP32 calculations ran on the CPU even when the NPU was preferenced.
My take away is that both selecting the optimal system and tuning software to run highly complex calculations on these highly complex systems is complicated...
Last edited: