What if the NPU was removed. With the M5's addition of neural acceleration to the GPU cores we're seeing a bit of redundancy here aren't we? That's a bit expensive at the chip level.
I feel like it was an always a bit of a bandaid solution anyway - to look like they had an active hardware roadmap for AI, especially on the A-series chips. But in my (limited) experience, you either used the gpu OR npu at the app level for acceleration, and when the gpu core counts increase withe the pro, max and ultra the npu benefit falls behind performance-wise to an almost irrelevance.
It's possible, if any of the above holds water, that the M5 is a transition to a full-fledged tilt at building a scalable gpu-centric response to pytorch/cuda.
The NPU fulfills a different purpose. In fact, Apple Silicon (since M1) contains at least three different ML accelerators, all optimized for different use cases:
- NPU for energy-efficient ML inference accelerating common applications
- AMX/SME for programmable low-latency scientific and ML workloads on the CPU
- GPU for scalable, programmable ML (think research, development, and large models)
The GPU matrix units do not replace the NPU. The later is designed to support common application use cases and do so with very low power consumption. Removing the NPU would be actually detrimental to the user experience.
By the way, this is also the reason why every user-facing platform contains an NPU nowadays. The GPU is a large truck - you use it to haul containers full of stuff. It’s not a practical tool for your daily commute.
... and Apple was widely regarded as being flat-footed when the "AI craze" rolled in. Which only further makes the point that it was a band-aid solution for something that it was never designed for. If its machine learning functions can be readily accommodated by a more robust and capable solution, why hang on to it?
Because they did not have AI models ready, and their hardware was not suitable for scalable ML. But for on-device ML inference Apple has always been the state of the art - and still is.
My guess is that they initially though that scalable ML will remain a specialized niche, so they focused on the application side of things and decided that generic GPU hardware solution is “good enough” for researchers and developers. They obviously underestimated the interest. Adding matrix acceleration to the GPU merely serves to address this weakness.
Last edited: