Properly optimized.... that's the ticket! So many people have demanded that Apple add in tons of RAM, but in doing so, developers get lazy and don't optimize, and then tons of RAM still isn't enough over time. I've appreciated Apple's conservative approach to RAM over the years, as it's forced optimizations, which is critical to software being great.
The idea that you need the latest machines to run AI is a bluff. They're using this as an excuse to force people to upgrade.
There seems to be some mistaken idea here amongst posters who think that this project shows that Apple's AI implementation is not optimized or that newer hardware is not needed to run large AI models. This is not the case. Llama2.c is an
unoptimized implementation of the Llama 2 architecture which is written in C and targets CPUs. It is designed to run inference on
simple instances of Llama LLMs. This is a different use case from the huge LLMs that have billions of parameters we most often talk about today. The stuff like ChatGPT.
What this developer did was use Altivec to take this
unoptimized CPU implementation and optimize it for the G4 SIMD architecture, making it run faster. But it's still constrained to run very small models. And if you read the blog post the Altivec improvements were pretty small. There's probably more room for more improvement, but SIMD programming is
hard. There is a good reason the number of apps which were optimized for Altivec was pretty small.
When you run LLMs on Apple's latest NPUs or GPUs they already have this kind of SIMD optimization in place. You can think of NPUs and GPUs as enormously parallel machines whose execution operates in a SIMD fashion at the lowest level (Apple's Metal uses a "SIMDGroup" as its unit of execution). The G4 SIMD unit when introduced could do up to a
billion calculations per second (Apple famously ran its G4 "supercomputer" ad which touted that it broke this threshold) but Apple Silicon NPUs can do 38
trillion operations per second. I hope that helps explain why, yes, we
do actually need modern AI accelerators like NPUs and GPUs to run the largest models.