What are these inherent advantages? Can I expect the same list as what was provided back in 1995?
The only advantage CISC had in 1995 was intel’s fabs.
The advantages RISC has are the same advantages it always had. X86 requires much more complex instruction decoders with microcode sequencers and microcode ROMs, much more complicated pipelines, much more complicated load/store logic, much more complicated branch prediction to compensate for the more complicated pipelines, etc. All because x86 allows writeable instruction streams, variable length instructions, ALU instructions that directly address memory, etc. For all these reasons RISC cores are much smaller than x86 cores. This allows greater clock frequency if you want it (because electrical signals travel at about 6ps per mm in a chip), less power usage (fewer switching transistors, so less charging of capacitance - power = cap * V squared * frequency), fewer gates between flip flops (which also allows higher clock frequency if you want it), fewer pipe stages (which means less of a penalty when you guess wrong on a conditional branch, etc. ARM also has more general purpose registers, which means fewer load/stores (with their inherent multi-cycle penalties), etc.
And that’s just the start of the technical advantages.
[automerge]1592922971[/automerge]
My undrstanding of CISC vs RISC is that RISC replaced complex memory to memory instructions with simpler (equal length) instructions that operated on registers in a larger register stack. This is combined with load / store instructions.
This makes for a simpler architecture, i.e. smaller and simpler core, less power consumed. Correct or?
Largely yes.
In CISC you may say:
ADD [memory A], [memory B] -> [memory C]
If taken literally, you would have to fetch two arguments from memory, perhaps sequentially, each taking dozens of cycles (even assuming the data was in the cache). Only then can you add. Then you have to write the results into memory. That takes forever.
In CISC you would:
LOAD [memory A], R1
LOAD [memory B], R2
ADD R1, R2 -> R3
STORE R3, [memory C]
Even in this tiny example, the STORE can be postponed and other stuff can execute, as long as R3 isn’t needed for anything. Only when R3 is needed do you have to do the store - at that point perhaps you store multiple things at once, to minimize the penalty.
And this also simplifies the hardware design tremendously. For x86 you may try to fix this by adding secret registers to hold the results of memory A and memory B and memory C. But then you need to keep track of what those registers hold (with tags), and have complex logic to decide when to deal with them vs when to just do a load/store.
You also, in any modern x86, essentially try to convert the first example to the second example. But the hardware to do that takes space and multiple pipeline stages. It’s simple in this example, but much harder in some of the goofier x86 things - like x86 lets you do things like:
ADD [memory A offset by the contents of register A], [memory B] -> [memory C]
Now you have to add just to figure out what memory address you are adding!