How is ARM better than x86? It’s been proven many times before that x86 has a performance advantage over ARM. What ARM does in 20 steps, x86 can do in 3.
I designed x86 chips, including the first 64-bit chip (where I designed the 64-bit extensions to the integer math instructions, as well as the design of various parts of the chip itself). I’ve also designed RISC chips (the first was a unique design roughly similar to MIPS, then a PowerPC chip which was the faster PowerPC in the world, and second in speed only to DEC Alpha, then a Sparc chip that was the designed to be the fastest SPARC in the world).
So I’ll explain why your statement is wrong.
First, there is zero proof of an x86 performance advantage - in fact, the *first time* that anyone used custom-CPU design techniques (the same techniques that are used at, for example, AMD and Intel) to try and design an Arm chip to compete with x86, it succeeded magnificently (that chip is M1). Prior to M1, nobody actually tried to make an Arm chip to compete with the heart of the x86 market.
Second, there are inherent technical disadvantages to x86. The instruction decoder takes multiple pipeline cycles beyond what Arm takes, and that will always be the case. As a result, every time a branch prediction is wrong, or a context-switch occurs, there is a much bigger penalty when the pipeline gets flushed. It also means that it is much harder to issue parallel instructions per core, because variable-length instructions mean you can’t see as much parallelizable code in a given instruction window size. We see this play out in Apple’s ability to issue, what, 6 instructions per cycle per core, vs. the max that x86 could ever possible do of 3 or 4. And we see in the real world that Arm allows many more in-flight instructions than x86.
Third, it does no good for x86 to do things in 3 steps vs Arm’s 20 when each of those 3 steps really is 7 internal steps. And that is what happens with x86. An instruction is not an instruction. An x86 instruction will, most of the time, be broken down in the instruction decoder into multiple sequential micro-ops, using a lookup table called a microcode ROM. So those 3 instructions end up being 20 microcode instructions. Each is then processed the same way that an Arm chip would process its own native instructions. The only difference is that Arm chips don’t need to spend the time, electricity and effort to do that conversion, and because the incoming instructions are pre-simplified, it makes it much easier for Arm chips to see far into the future and to parallelize whatever can be parallelized as the instructions come in.
The reason “complex’ instructions like x86 were once an advantage had nothing whatsoever to do with performance - it was done that way because RAM used to be very expensive, so encoding common strings of instructions into complex instructions saved on instruction memory. At one point it also might have provided a slight speed boost, too, because there were no instruction caches back then, so fetching less stuff from memory might have given a slight speed improvement in certain situations (rare ones. It’s a latency issue only on unpredicted branches).
And, of course, the extra hardware to deal with x86 instruction decoding is not free - on the cores I designed, the instruction decoder was 20% of the core (not including caches). That’s big. On the risc hardware I designed, the decoders are much tinier than that. In addition to space, that means power is used, and circuits that want to be next to each other have to be farther apart to make room for it.
Another inherent advantage of Arm over x86 is the number of registers. x86 has a very small number of registers - I can’t, as I sit here, think of a modern architecture with fewer. For the same reasons as you want to avoid instruction memory accesses, you also want to avoid data memory accesses (only more so!). And the fewer registers you have, the more often you will have to perform memory accesses. That’s simply unavoidable. Each memory access takes hundreds of times longer (or thousands or more if you can’t find what you need in the cache) than reading or writing to a register. And because x86 has so few registers, you will have no choice but to do that a lot more than on Arm.
Why does x86 have few registers? Again, for historical reasons that no longer apply. Whenever you have a context switch (e.g. you switch from one process to another) you need to flush the registers to memory. The more you have, the longer that takes. But now we have memory architectures and buses that allow writing a lot of data in parallel, so you can have more registers and not pay that penalty at all. So why didn’t x86-64 just add a ton more registers? (It added some). Because the weird pseudo-accumulator style x86 way of encoding instructions couldn’t benefit from them too easily, and still allow compilers to easily be ported to it.
So the tldr; version:
1) the only “proof’ out there is that Arm has a technological advantage over x86
2) RISC is better than CISC, given modern technological improvements and constraints
3) there are specific engineering reasons why this is so