Developer Delves Into Reasons Why Apple's M1 Chip is So Fast

cmaier · Mar 13, 2021

perris said:
Perfect, you've answered everything, thank you very much for the private tutoring cmaier!

Clearly a decent reason for smt in laptop, especially transistor limited on the soc, and of course the r/d/cost the added core, there's also some nice marketing effects of "8 cores 16 threads"

Such a good information source, thanks for this discussion, I'm good to go

On the R&D side - it’s much less work to add a core vs adding HT. HT greatly complicates the verification process, makes a bunch of circuits more complicated, increases the likelihood of bugs, and, as we now know, is a good source of vulnerabilities to side-channel attacks like SPECTRE.

Once you’ve done the work to allow more than 1 core, the added R&D cost of each additional core is not that big (up to a point. You also need to make sure you have a good asynchronous on-chip bus design if the number of cores gets high). More cores definitely increases the marginal cost of making the chips, though, as it takes more die area.

cmaier · Mar 13, 2021

ArPe said:
They are RISC with CISC interpreters for backwards compatibility just to be accurate. Intel introduced the internal RISC core with Pentium Pro (P6).

Having designed x86-64 cores, they aren’t really “risc.” And it’s sort of meaningless to call them that - each core has to have a complex instruction decoder with a state machine, a microcode ROM, etc. Each core has to account for the all the x86-CISC behavior. There is no “CISC interpreter.” Each core receives CISC instructions, spends a lot of effort converting them to microcode (which has been going on a lot longer than P6), has to keep track of which microcode instructions go together (because if one microcode causes an interrupt, for example, they all need to be flushed together), etc. All the CISC stuff permeates throughout the circuitry in the core.

It’s sort of a “did you know the lark is actually nature’s most unhappy bird?” bit of common knowledge among people who never designed actual chips to say “this is really RISC with some CISC interpreter around it.”

On the other hand, Transmeta really did do something like that. But there’s nothing like that in Intel’s or AMD’s chips.

cmaier · Mar 13, 2021

ArPe said:
yes and no, in the Pentium Pro IA-32 instructions were translated to buffered RISC like micro operations and still are today. It wasn’t just marketing. The micro ops are the same as RISC and outperformed fully RISC CPUs on a clock per clock basis. But it did lose out once DEC Alpha chips had much higher clock speed (200Mhz vs 500Mhz). But then, you could build a dual Pentium Pro system for cheaper than a DEC.

The microops are not the same as RISC, at least because they are not independent of each other. There are also complications regarding the register file, and what happens when you run out of scratch registers (and steps you take to avoid scratch registers). Pretty much every internal block in each core has to be aware of aspects of the original CISC instruction to handle context switches, incorrectly-predicted branches, load/store blocking, etc. Having designed the scheduling unit for one of these bad boys, it’s really quite a pain in the neck, takes a lot of circuitry, adds to cycle time (slows the chip down), and results in wires running all over the place to send these extra signals from place to place.

Having also designed true RISC CPUs (sparc, mips, PowerPC), x86-64 cores are nothing at all like those.

ArPe · Mar 13, 2021

cmaier said:
The microops are not the same as RISC, at least because they are not independent of each other. There are also complications regarding the register file, and what happens when you run out of scratch registers (and steps you take to avoid scratch registers). Pretty much every internal block in each core has to be aware of aspects of the original CISC instruction to handle context switches, incorrectly-predicted branches, load/store blocking, etc. Having designed the scheduling unit for one of these bad boys, it’s really quite a pain in the neck, takes a lot of circuitry, adds to cycle time (slows the chip down), and results in wires running all over the place to send these extra signals from place to place.

Having also designed true RISC CPUs (sparc, mips, PowerPC), x86-64 cores are nothing at all like those.

I used the term RISC-like referring to literature that has been around for 25 years. I mentioned true RISC such as DEC’s.

09872738 · Mar 13, 2021

ArPe said:
I used the term RISC-like referring to literature that has been around for 25 years. I mentioned true RISC such as DEC’s.

But isn‘t that literature based on bogus Intel marketing-terminology? Technically is not really RISC

Alan Wynn · Mar 13, 2021

09872738 said:
But isn‘t that literature based on bogus Intel marketing-terminology? Technically is not really RISC

It is not only “not really RISC”, @cmaier has made clear repeatedly, it is not in any meaningful way “RISC-like”.

cmaier · Mar 13, 2021

ArPe said:
I used the term RISC-like referring to literature that has been around for 25 years. I mentioned true RISC such as DEC’s.

It’s just a weird use of the terminology, since nobody in their right mind would consider designing chips the way they did at the beginning of the CISC days - it wouldn’t even be possible. If you think about the origins of x86, you had an instruction set that was based on what the circuit design needed, and not vice versa.

”Gee, I need two nand gates and a nor to convert the op code to the wires I need at the inputs of the barrel shifter in the ALU. I have an idea! Let’s just make a new instruction and have the user encode the five bits I need directly into the operands!”

Then they get the follow-up chip and realize they screwed themselves, because the barrel shifter design changed.

So I’d say that modern designs are a little less CISC-y - there is now a thin layer of abstraction between the ISA and the physical design. But being “less CISC-y” does *not* necessarily mean you are being “more RISC-y.”

Think of a dumb x86 instruction like PUSHA, which may be broken into simpler instructions like:

PUSH AX
PUSH CX
PUSH DX
PUSH BX
PUSH SP ; The value stored is the initial SP value
PUSH BP
PUSH SI
PUSH DI

But if one of those PUSH’s causes an exception, and several of them have already happened, what do you do? It requires a bunch of circuitry to deal with that sort of stuff, so even though it might look RISC-like in the sense that the individual instructions are simpler, it’s only the instructions that are simpler, not the actual design. As a result, it doesn’t benefit from the instruction simplicity to the same extent as a RISC machine.

ArPe · Mar 14, 2021

cmaier said:
It’s just a weird use of the terminology, since nobody in their right mind would consider designing chips the way they did at the beginning of the CISC days - it wouldn’t even be possible. If you think about the origins of x86, you had an instruction set that was based on what the circuit design needed, and not vice versa.

”Gee, I need two nand gates and a nor to convert the op code to the wires I need at the inputs of the barrel shifter in the ALU. I have an idea! Let’s just make a new instruction and have the user encode the five bits I need directly into the operands!”

Then they get the follow-up chip and realize they screwed themselves, because the barrel shifter design changed.

So I’d say that modern designs are a little less CISC-y - there is now a thin layer of abstraction between the ISA and the physical design. But being “less CISC-y” does *not* necessarily mean you are being “more RISC-y.”

Think of a dumb x86 instruction like PUSHA, which may be broken into simpler instructions like:

PUSH AX
PUSH CX
PUSH DX
PUSH BX
PUSH SP ; The value stored is the initial SP value
PUSH BP
PUSH SI
PUSH DI

But if one of those PUSH’s causes an exception, and several of them have already happened, what do you do? It requires a bunch of circuitry to deal with that sort of stuff, so even though it might look RISC-like in the sense that the individual instructions are simpler, it’s only the instructions that are simpler, not the actual design. As a result, it doesn’t benefit from the instruction simplicity to the same extent as a RISC machine.

OK but understand I’m not going to spend 70 years of my life posting on a forum to strangers. I add a generalized comment and ‘hope’ people go on Wikipedia and check references for a deeper explanation that was written already by many years ago. Repetition is a waste of life.

firewood · Mar 14, 2021

cmaier said:
It’s just a weird use of the terminology, since nobody in their right mind would consider designing chips the way they did at the beginning of the CISC days

Maybe not chips, but before chips, Cray (maybe earlier CDC's as well) and the IBM 801 had RISC-like ISAs, basically for the same reason, to issue and retire instructions really fast (given the technology) with minimal hazards.

As for CISC, although John Wharton has passed away, my impression from talking with him is that the Intel 8051 instruction set was designed before the transistor implementation (but with the transistor budget in mind).

firewood · Mar 18, 2021

cmaier said:
The microops are not the same as RISC, at least because they are not independent of each other. There are also complications regarding the register file, and what happens when you run out of scratch registers (and steps you take to avoid scratch registers). Pretty much every internal block in each core has to be aware of aspects of the original CISC instruction to handle context switches, incorrectly-predicted branches, load/store blocking, etc. Having designed the scheduling unit for one of these bad boys, it’s really quite a pain in the neck, takes a lot of circuitry, adds to cycle time (slows the chip down), and results in wires running all over the place to send these extra signals from place to place.

How then does Rosetta 2 translate x86 op codes into RISC micro-ops (aka arm64) for execution by the M1? (The ARM ISA core not being aware of the original x86 source).

Although, IIRC, the M1 did add some "awareness" (a mode?) that it was executing x86 ops... something about restricting the write buffer (re)ordering allowed by the memory models.

cmaier · Mar 18, 2021

firewood said:
How then does Rosetta 2 translate x86 op codes into RISC micro-ops (aka arm64) for execution by the M1? (The ARM ISA core not being aware of the original x86 source).

Although, IIRC, the M1 did add some "awareness" (a mode?) that it was executing x86 ops... something about restricting the write buffer (re)ordering allowed by the memory models.

It’s a lot easier to do a software translation, because you can see the entire instruction stream (not just a limited window), you can create memory structures in RAM to hold intermediate data, etc., and you don’t need to translate in “real time.” Remember Rosetta 2 (mostly) does a static, one time, translation.

Unregistered 4U · Mar 18, 2021

firewood said:
How then does Rosetta 2 translate x86 op codes into RISC micro-ops (aka arm64) for execution by the M1? (The ARM ISA core not being aware of the original x86 source).

Although, IIRC, the M1 did add some "awareness" (a mode?) that it was executing x86 ops... something about restricting the write buffer (re)ordering allowed by the memory models.

And, the translation that Rosetta is doing is against code that was compiled with Xcode. Because of this, Apple really knows a great deal about the intent of the code and that helps the effectiveness of the translation.

cgsnipinva · Apr 2, 2021

cmaier said:
Well, more transistors doesn’t do you any good if you don’t put them to good use, right?

If you think about single-threaded operations, there are only so many transistors you can put to good use. Adding more functional units (adders, multipliers, shifters, etc.) has diminishing returns after 4. Beyond 8 it really is not likely to be worth it. Adding more cache is nice, but you can only add so much before you get diminishing returns (both because you aren’t using it all, and because the more you have, the more distant some of it will be from the core, meaning the slower it will be). Same with branch prediction unit, etc.

So, if you look into it, most ALUs (the part of the CPU that does integer math) use about the same number of transistors (within the same order of magnitude. I’m speaking of this in a qualitative, not quantitative, sense.).

For parallel operations you can add more cores. That, too, has diminishing returns. 8 is probably more than enough for 95% of uses, and the benefit falls off rapidly each time you double the number of cores.

So what apple did with the transistors is use a lot of them for other types of computation - graphics, machine learning, etc. etc. These are all good uses, but they don’t explain why M1 destroys Intel even for traditional CPU functions that make no use of all this other circuitry.

On the same process, I would expect AMD to lag Apple by around 10-20% in performance/power, because of the CISC penalty. Of course AMD may run faster, but that’s only because they choose to burn a lot more power to do it, for example. That’s why we think in terms of performance/watt. AMD probably uses a similar design methodology to Apple, so I would think any differences in performance/watt would be a pretty good indication of the benefits of Arm vs. x86-64.

As for AMD, having designed many processors for them, I wouldn’t describe them as hybrid. They are CISC. CISC and RISC mean different things to different people, and can mean different things depending on the context, but there are two things that any CPU designer would likely agree scream “this is CISC!” And AMD and x86 have both of them:

1) to decode the instructions, you need a state machine. In other words, you have to take multiple clock cycles to decode the instruction, because you have to make multiple passes over the instruction stream to figure out what is going on, convert the instructions to different kinds of instructions, detect dependencies, etc.

2) any arbitrary instruction can access memory. In RISC, only LOAD and STORE instructions (and maybe one or two others depending on the architecture) can access memory. This greatly simplifies what is necessary for dealing with register mapping, inter-instruction dependencies for scheduling, etc.

I would really really really like to buy you bunch of beers, get you a white board and dry-erase markers and just talk about this. Interesting posts - sir - edifying and enjoyble.

cmaier · Apr 2, 2021

cgsnipinva said:
I would really really really like to buy you bunch of beers, get you a white board and dry-erase markers and just talk about this. Interesting posts - sir - edifying and enjoyble.

That‘s how most of our work was done at AMD. Drunk whiteboard block diagrams

. At exponential, too. Not at Sun, though. Those guys were no fun.

thedocbwarren · Apr 3, 2021

cmaier said:
That‘s how most of our work was done at AMD. Drunk whiteboard block diagrams . At exponential, too. Not at Sun, though. Those guys were no fun.

You had more fun than I did at IBM!

cmaier · Apr 3, 2021

thedocbwarren said:
You had more fun than I did at IBM!

I once interviewed at Burlington. Seemed reasonably fun there. Though now that I recall, they didn’t believe me that I had done the things on my resume, because another guy in my research group interviewed there first, and he claimed he did my work. That was kind of weird. They gave that guy an offer and I think he took it. I may be misremembering. Long time ago.

Search

Search

Developer Delves Into Reasons Why Apple's M1 Chip is So Fast

cmaier

Suspended

cmaier

Suspended

cmaier

Suspended

ArPe

Suspended

09872738

Cancelled

Alan Wynn

macrumors 68030

cmaier

Suspended

ArPe

Suspended

firewood

macrumors G3

firewood

macrumors G3

cmaier

Suspended

Unregistered 4U

macrumors G4

cgsnipinva

macrumors 6502

cmaier

Suspended

thedocbwarren

macrumors 6502

cmaier

Suspended

Our Staff