Well, more transistors doesn’t do you any good if you don’t put them to good use, right?
If you think about single-threaded operations, there are only so many transistors you can put to good use. Adding more functional units (adders, multipliers, shifters, etc.) has diminishing returns after 4. Beyond 8 it really is not likely to be worth it. Adding more cache is nice, but you can only add so much before you get diminishing returns (both because you aren’t using it all, and because the more you have, the more distant some of it will be from the core, meaning the slower it will be). Same with branch prediction unit, etc.
So, if you look into it, most ALUs (the part of the CPU that does integer math) use about the same number of transistors (within the same order of magnitude. I’m speaking of this in a qualitative, not quantitative, sense.).
For parallel operations you can add more cores. That, too, has diminishing returns. 8 is probably more than enough for 95% of uses, and the benefit falls off rapidly each time you double the number of cores.
So what apple did with the transistors is use a lot of them for other types of computation - graphics, machine learning, etc. etc. These are all good uses, but they don’t explain why M1 destroys Intel even for traditional CPU functions that make no use of all this other circuitry.
On the same process, I would expect AMD to lag Apple by around 10-20% in performance/power, because of the CISC penalty. Of course AMD may run faster, but that’s only because they choose to burn a lot more power to do it, for example. That’s why we think in terms of performance/watt. AMD probably uses a similar design methodology to Apple, so I would think any differences in performance/watt would be a pretty good indication of the benefits of Arm vs. x86-64.
As for AMD, having designed many processors for them, I wouldn’t describe them as hybrid. They are CISC. CISC and RISC mean different things to different people, and can mean different things depending on the context, but there are two things that any CPU designer would likely agree scream “this is CISC!” And AMD and x86 have both of them:
1) to decode the instructions, you need a state machine. In other words, you have to take multiple clock cycles to decode the instruction, because you have to make multiple passes over the instruction stream to figure out what is going on, convert the instructions to different kinds of instructions, detect dependencies, etc.
2) any arbitrary instruction can access memory. In RISC, only LOAD and STORE instructions (and maybe one or two others depending on the architecture) can access memory. This greatly simplifies what is necessary for dealing with register mapping, inter-instruction dependencies for scheduling, etc.