Or feel free to elaborate on anything from that post? If it's over my head, I mean... I get that, but don't be afraid to dumb it down for me.
Sorry if my post came over a bit crass, I guess my reply was addressed not just to you but also to others in this thread, and so things got mixed up somehow. I'll try to explain in more detail. Disclaimer: I am by no means a CPU design expert, and it has been over a decade that I did some real low-level programming, but I did spent majority of my teenager time with assembler and machine code and I still like to be informed about how stuff works. I don't claim to have full internal knowledge of ARM or Intel/AMD CPUs though.
The problem with terms RISC and CISC is that people throw them around, but they usually fail to distinguish between the messenger and the message. RISC and CISC first and foremost are design principles behind instruction sets — languages which are used to control the CPU execution. The question is though how the CPU implements these instruction sets. And here we come to a very important point which almost inevitably gets lost in discussions lie this ones: there are no
high-performance mainstream CISC CPUs, because its very difficult to build one. Let me try to explain why.
CISC has more complex instructions, which can do more. For example, a common staple of CISC is the ability to fetch data from memory and do an operation on this data in the same instruction. In contrast, RISC usually don't have such instructions. There are instructions that can load data from memory to local file (registers), and then there are instructions that operate on that. So do do a load+op, you'd need two RISC instructions while only one CISC instruction.
Easily CISC CPU designs implemented the instructions in a straightforward manner. The CPU would check the next instruction and do what it says. If it said "load data from memory", it would do just that, if it said "load data from register", it would do that instead. The problem is that this model simply doesn't scale. Its extremely inefficient. You need to implement dedicated CPU logic for every instruction in your set, which is a lot of transistors, parts of the CPU are assertively stalled (e.g. your compute unit needs to wait for the memory read) and so on. This wasn't a big issue back in the day, where everything was rather slow, but it quickly became a serious performance problem for Intel. So this is why modern Intel CPUS are CISC on the surface only.
The way it actually works is that every CISC instruction is encoded as a series of much simple microcode operations (which is essentially an internal CPU-specific RISC instruction set). A part of the CPU is responsible for fetching x86 instructions and translating them to "real" machine code, which the CPU can then execute. And these simpler operations get reordered on the fly to enable better resource utilisation. This is instruction-level parallelism (or superscalarity) — the CPU can execute parts of different CISC instructions at the same time. For example, going back to the load+operation example — the CPU could start loading the data, and while it travels from memory, it could use the compute unit to perform some other operation. This way there is no waiting, no stalling, and all the computational resources are utilised.
So to sum this up, both current ARM and Intel/AMD CPUs are essentially RISC. The Intel/AMD simply use a more complex "richer" instruction set which they decompose to RISC instructions on the fly (there are also microcodes in ARM CPUs, but AFAIK, the complexity is on a different level). While you still need two instructions in ARM to do load+operation, both of these instructions have constant execution times. And Intel's single load+operation insnruction can take variable time, depending on whether you are loading from a register or from memory. So there is not much practical difference here. Furthermore, both ARM and Intel/AMD use superscalar execution model and have multiple compute/execution units per CPU core, which allows them to execute multiple instructions at the same time, on a single thread.
Now to the second part: what do I mean with "legacy" in regards to Intel CISC instruction set? Well, that system has its origins in dark ages of computing and contains some very weird instructions. Not to mention that the encoding scheme of Intel instruction set is a bit of a mess. Instructions have different lengths, they can have different types of operands, sometimes there are weird limitations. With 64bit ARM, every instruction is 32bit, the coding scheme is transparent and the instruction decoder on the CPU can be much simpler. Also, the instruction set itself is very symmetrical — as put by a LLVM developer (I saw it on a presentation somewhere, can't find it anymore) — 64bit ARM is as friendly to compiler developer as it gets.
So no, I don't see how ARM is limited by its instruction set. Sure, it sometimes needs several instructions to express what Intel x86 can do in one instruction, but I don't see why this should be a disadvantage in practice. There are pro's and contra's to both approaches: ARM needs more instructions sometimes, but the instruction set is lean and easy to decode and the instruction flow is arguably simpler to analyse. Intel can save on instructions, but pays with vastly more complex instruction encoding scheme.
In the end, the most important thing (as I mentioned in the beginning of this ridiculously long post) is to understand the difference between the instruction set and how its implemented. Modern Intel CPUs might use same register names as the CPUs in the 90ties, but it doesn't mean that they even have these registers. CPU instructions is just a programming language, and it gets translated to other internal representations just as every other programming language.
Hope this was at least remotely useful.
OR: anyone just have a book I could look up that covers the modern differences in the architectures? Or a handful of reference books we could make a sticky about? Would probably prevent a lot of misinformation from floating around. Plus, I don't mind going to a library if it means I will learn something.
The best source I am aware of are Intel and ARM technical documentation, as well as the incredible work done by Agner Fong:
http://www.agner.org/optimize/ He has a very detailed description of Intel-compatible CPU architecture