Developer Delves Into Reasons Why Apple's M1 Chip is So Fast

Analog Kid · Dec 2, 2020

m1mavrerick said:
I mentioned it because the two are more alike than not. It is because of this that I have said the M1 being RISC is not as important as people make it out to be. IMO there are other factors, such as unified memory, specialized processors, and on board GPU which are far more important. RISC is a benefit but it's a slight benefit compared to other design decisions which went into the M1.

There are system level benefits to the other factors you mention, but I don’t think they explain this:

Much less the fact that M1 does this without breaking a sweat.

At most, the SoC will help reduce latency when there’s a cache miss which is rare for these benchmarks.

m1maverick · Dec 2, 2020

Analog Kid said:
Ignore the resume, and review the post history. You can tell a lot more about someone from what they say than who they claim to be. For example, having read back through your dozen posts bickering back and forth I notice you haven't yet made a true technical argument.

There are a few people in these forums that speak sense. @cmaier is one of them. You don't have to agree with every word someone says to recognize that they know what they're talking about. Whereas your reliance on snark and ad hominem attacks suggest the opinion you're defending is probably not well supported.

But one can learn much more about commanding a ship from the captain of the Titanic than from some guy on the dock talking smack about him.

What kind of technical argument would you like? Would you like to hear about out of order execution? Register renaming? uOps? Decoders? Execution units? Orthogonal instruction sets?

Fact is even Apple isn't focusing on the RISC nature of the M1 in their description as to why the M1 is as fast as it is. They speak to unified memory, GPU, neural engines, etc.

As I opened my involvement in this thread with: I've heard this for so many years back in the PPC days. No new arguments here.

m1maverick · Dec 2, 2020

Analog Kid said:
There are system level benefits to the other factors you mention, but I don’t think they explain this:

Much less the fact that M1 does this without breaking a sweat.

At most, the SoC will help reduce latency when there’s a cache miss which is rare for these benchmarks.

Why not? Don't tell us you don't think, tell us the technical reasons which led you to such a conclusion.

cgsnipinva · Dec 2, 2020

Analog Kid said:
There are system level benefits to the other factors you mention, but I don’t think they explain this:

View attachment 1685257

Much less the fact that M1 does this without breaking a sweat.

At most, the SoC will help reduce latency when there’s a cache miss which is rare for these benchmarks.

It would be interesting to see more information on the specific cores (excluding the SOC impact) -- to examine their performance. It would be quite interesting.

cgsnipinva · Dec 2, 2020

m1mavrerick said:
What kind of technical argument would you like? Would you like to hear about out of order execution? Register renaming? uOps? Decoders? Execution units? Orthogonal instruction sets?

Fact is even Apple isn't focusing on the RISC nature of the M1 in their description as to why the M1 is as fast as it is. They speak to unified memory, GPU, neural engines, etc.

As I opened my involvement in this thread with: I've heard this for so many years back in the PPC days. No new arguments here.

If there are no real benefits to using RISC - why didn't Apple base their SOC on x86? Wouldn't they get the same benefit if its the SOC approach that is yielding all the benefits? It seems it would have been a simpler transition which would not require software to be recompiled/factored for a new architecture?

m1maverick · Dec 2, 2020

MacBH928 said:
So if RISC is so superior why did anyone go with CISC in the first place?

Memory constraints. CISCs benefit is reduced instruction set size. Let's take multiplication as an example. It consumes less memory to use a single multiplication instruction than it is to construct that same multiplication using additions and shifts. Also, it's likely to be faster as it's executed in "hardware".

To illustrate using multiplication. Here is a multiplication routine written in 6502 assembly language (the 6502 does not have a multiplication instruction):

Code:

;32 bit multiply with 64 bit product

MULTIPLY:  lda     #$00
           sta     PROD+4   ;Clear upper half of
           sta     PROD+5   ;product
           sta     PROD+6
           sta     PROD+7
           ldx     #$20     ;Set binary count to 32
SHIFT_R:   lsr     MULR+3   ;Shift multiplyer right
           ror     MULR+2
           ror     MULR+1
           ror     MULR
           bcc     ROTATE_R ;Go rotate right if c = 0
           lda     PROD+4   ;Get upper half of product
           clc              ; and add multiplicand to
           adc     MULND    ; it
           sta     PROD+4
           lda     PROD+5
           adc     MULND+1
           sta     PROD+5
           lda     PROD+6
           adc     MULND+2
           sta     PROD+6
           lda     PROD+7
           adc     MULND+3
ROTATE_R:  ror     a        ;Rotate partial product
           sta     PROD+7   ; right
           ror     PROD+6
           ror     PROD+5
           ror     PROD+4
           ror     PROD+3
           ror     PROD+2
           ror     PROD+1
           ror     PROD
           dex              ;Decrement bit count and
           bne     SHIFT_R  ; loop until 32 bits are
           clc              ; done
           lda     MULXP1   ;Add dps and put sum in MULXP2
           adc     MULXP2
           sta     MULXP2
           rts

Using CISC, assuming the processor has a multiplication instruction, one could merely write:

Code:

           mul     R1, R2      ; Multiply the values of R1 and R2

Note: The CISC instruction is hypothetical and is intended for illustrative purposes only.

As you can see the CISC code is much, much smaller. As for execution time it depends on how quickly a processor can execute all of the instructions used to build the multiplication routine versus the time taken to execute a single multiplication instruction.

The concept of RISC is each instruction can be executed very quickly and therefore the sum to execute all of them would be less than that of a single multiplication instruction. For CISC the idea is that the built in MUL op code would execute faster than a bunch of smaller op codes. However not all processor instructions are actually hardwired. They are built upon microcode which is, for this discussion, a small program within the processor itself.

Once again the 80/20 rule can be invoked. RISC designers found that applications primarily (80%) consist of a small number of frequently used instructions and optimized for them. The other 20% could, as the multiplication example above shows, be built from those instructions.

HTH

m1maverick · Dec 2, 2020

cgsnipinva said:
If there are no real benefits to using RISC - why didn't Apple base their SOC on x86? Wouldn't they get the same benefit if its the SOC approach that is yielding all the benefits? It seems it would have been a simpler transition which would not require software to be recompiled/factored for a new architecture?

Why do you say there's no real benefit to using RISC?

cmaier · Dec 2, 2020

cgsnipinva said:
If there are no real benefits to using RISC - why didn't Apple base their SOC on x86? Wouldn't they get the same benefit if its the SOC approach that is yielding all the benefits? It seems it would have been a simpler transition which would not require software to be recompiled/factored for a new architecture?

AMD has (I believe) been doing SoC since at least bulldozer. I am quoted about it here:

Former AMD engineer talks about Bulldozer

X-bit Labs reports an ex-AMD engineer spoke out about some of the reasons why Bulldozer disappoints: The reason why performance of the long-awaited Bulldozer was below expectations is not only because...

www.guru3d.com

Yet they can match M1 only by running the clock at 166% of M1’s.

Analog Kid · Dec 2, 2020

m1mavrerick said:
Would you like to hear about out of order execution? Register renaming? uOps? Decoders? Execution units? Orthogonal instruction sets?

Yes.

Not just a list of bullet points and names you pull from a block diagram, but a specific logical argument about how each one refutes the point that RISC makes for a lower power, more performant architecture than CISC.

Analog Kid · Dec 2, 2020

m1mavrerick said:
Why not? Don't tell us you don't think, tell us the technical reasons which led you to such a conclusion.

Sorry, it's a habit to not to assume 100% certainty in anything...

Embedded GPU, specialized processors, and unified memory don't explain the single core benchmark difference because the the benchmark measures the performance of a single core. It isolates the CPU core from the system. The GPU doesn't play a role, the special coprocessors don't play a role. As I said the in package memory will cut the latencies in the event of a cache miss but if you look at the Geekbench documentation, there's not a lot of cache misses on the Intel reference run.

m1maverick · Dec 2, 2020

cmaier said:
AMD has (I believe) been doing SoC since at least bulldozer. I am quoted about it here:

Former AMD engineer talks about Bulldozer

X-bit Labs reports an ex-AMD engineer spoke out about some of the reasons why Bulldozer disappoints: The reason why performance of the long-awaited Bulldozer was below expectations is not only because...

www.guru3d.com

Yet they can match M1 only by running the clock at 166% of M1’s.

Nothing like comparing a 2011 architecture to one from 2020.

m1maverick · Dec 2, 2020

Analog Kid said:
Yes.

Not just a list of bullet points and names you pull from a block diagram, but a specific logical argument about how each one refutes the point that RISC makes for a lower power, more performant architecture than CISC.

The problem is: They don't. Most are shared between the two architectures. Orthogonal instructions make the task of decoding instructions easier but instruction decoding is such a small part of a CPU die that it's not much of an issue any more.

There you go. You're welcome.

cmaier · Dec 2, 2020

m1mavrerick said:
Memory constraints. CISCs benefit is reduced instruction set size. Let's take multiplication as an example. It consumes less memory to use a single multiplication instruction than it is to construct that same multiplication using additions and shifts. Also, it's likely to be faster as it's executed in "hardware".

To illustrate using multiplication. Here is a multiplication routine written in 6502 assembly language (the 6502 does not have a multiplication instruction):

Code:

;32 bit multiply with 64 bit product MULTIPLY: lda #$00 sta PROD+4 ;Clear upper half of sta PROD+5 ;product sta PROD+6 sta PROD+7 ldx #$20 ;Set binary count to 32 SHIFT_R: lsr MULR+3 ;Shift multiplyer right ror MULR+2 ror MULR+1 ror MULR bcc ROTATE_R ;Go rotate right if c = 0 lda PROD+4 ;Get upper half of product clc ; and add multiplicand to adc MULND ; it sta PROD+4 lda PROD+5 adc MULND+1 sta PROD+5 lda PROD+6 adc MULND+2 sta PROD+6 lda PROD+7 adc MULND+3 ROTATE_R: ror a ;Rotate partial product sta PROD+7 ; right ror PROD+6 ror PROD+5 ror PROD+4 ror PROD+3 ror PROD+2 ror PROD+1 ror PROD dex ;Decrement bit count and bne SHIFT_R ; loop until 32 bits are clc ; done lda MULXP1 ;Add dps and put sum in MULXP2 adc MULXP2 sta MULXP2 rts

Using CISC, assuming the processor has a multiplication instruction, one could merely write:

Code:

mul R1, R2 ; Multiply the values of R1 and R2

Note: The CISC instruction is hypothetical and is intended for illustrative purposes only.

As you can see the CISC code is much, much smaller. As for execution time it depends on how quickly a processor can execute all of the instructions used to build the multiplication routine versus the time taken to execute a single multiplication instruction.

The concept of RISC is each instruction can be executed very quickly and therefore the sum to execute all of them would be less than that of a single multiplication instruction. For CISC the idea is that the built in MUL op code would execute faster than a bunch of smaller op codes. However not all processor instructions are actually hardwired. They are built upon microcode which is, for this discussion, a small program within the processor itself.

Once again the 80/20 rule can be invoked. RISC designers found that applications primarily (80%) consist of a small number of frequently used instructions and optimized for them. The other 20% could, as the multiplication example above shows, be built from those instructions.

HTH

6502 was CISC, not RISC.

Also, all RISC processors have multiply instructions, and they all take multiple cycles to execute, just like on CISC machines.

The concept of RISC is not what you say it is.

Analog Kid · Dec 2, 2020

m1mavrerick said:
Memory constraints. CISCs benefit is reduced instruction set size. Let's take multiplication as an example. It consumes less memory to use a single multiplication instruction than it is to construct that same multiplication using additions and shifts. Also, it's likely to be faster as it's executed in "hardware".

To illustrate using multiplication. Here is a multiplication routine written in 6502 assembly language (the 6502 does not have a multiplication instruction):

Code:

;32 bit multiply with 64 bit product MULTIPLY: lda #$00 sta PROD+4 ;Clear upper half of sta PROD+5 ;product sta PROD+6 sta PROD+7 ldx #$20 ;Set binary count to 32 SHIFT_R: lsr MULR+3 ;Shift multiplyer right ror MULR+2 ror MULR+1 ror MULR bcc ROTATE_R ;Go rotate right if c = 0 lda PROD+4 ;Get upper half of product clc ; and add multiplicand to adc MULND ; it sta PROD+4 lda PROD+5 adc MULND+1 sta PROD+5 lda PROD+6 adc MULND+2 sta PROD+6 lda PROD+7 adc MULND+3 ROTATE_R: ror a ;Rotate partial product sta PROD+7 ; right ror PROD+6 ror PROD+5 ror PROD+4 ror PROD+3 ror PROD+2 ror PROD+1 ror PROD dex ;Decrement bit count and bne SHIFT_R ; loop until 32 bits are clc ; done lda MULXP1 ;Add dps and put sum in MULXP2 adc MULXP2 sta MULXP2 rts

Using CISC, assuming the processor has a multiplication instruction, one could merely write:

Code:

mul R1, R2 ; Multiply the values of R1 and R2

Note: The CISC instruction is hypothetical and is intended for illustrative purposes only.

As you can see the CISC code is much, much smaller. As for execution time it depends on how quickly a processor can execute all of the instructions used to build the multiplication routine versus the time taken to execute a single multiplication instruction.

The concept of RISC is each instruction can be executed very quickly and therefore the sum to execute all of them would be less than that of a single multiplication instruction. For CISC the idea is that the built in MUL op code would execute faster than a bunch of smaller op codes. However not all processor instructions are actually hardwired. They are built upon microcode which is, for this discussion, a small program within the processor itself.

Once again the 80/20 rule can be invoked. RISC designers found that applications primarily (80%) consist of a small number of frequently used instructions and optimized for them. The other 20% could, as the multiplication example above shows, be built from those instructions.

HTH

Your example is a bit weird. Arm has a MUL instruction.

Yes, CISC was originally an attempt to reduce the memory footprint, but mostly it was the first cut at CPU design at a time when compilers were in their infancy. The assembly code was the language of the CPU, so the instructions were chosen based on what the designer thought the chip would be asked to do. You want to add two values from memory? There's an instruction for that.

RISC identified the weaknesses in the CISC approach in the context of modern development tools and tried to better partition the complexity.

VLIW tried to take that a step further and push the scheduling problem out the compiler but, so far at least, that's been a failure for everything but a few special purpose applications (see TI DSPs)-- mostly because it doesn't handle generational changes in architecture very well.

All of the fanciness around CISC these days is less about the benefits of the CISC architecture in itself and more about the value of keeping backwards compatibility. All that technical debt finally came due.

Other than maybe some legacy 8051 designs, there aren't a lot of CISC architectures around anymore.

So I think the question is less about "if RISC is better, why did they use CISC to begin with" (because CISC was first) and more a question of once RISC came to prominence, how many successful new CISC architectures have we seen?

m1maverick · Dec 2, 2020

Analog Kid said:
Sorry, it's a habit to not to assume 100% certainty in anything...

Embedded GPU, specialized processors, and unified memory don't explain the single core benchmark difference because the the benchmark measures the performance of a single core. It isolates the CPU core from the system. The GPU doesn't play a role, the special coprocessors don't play a role. As I said the in package memory will cut the latencies in the event of a cache miss but if you look at the Geekbench documentation, there's not a lot of cache misses on the Intel reference run.

Unified memory definitely can play a significant factor in accelerating processor performance. Apple designed their memory subsystem to be fast. A fast processor is useless if it can't retrieve data quickly. Why else do you think processors contain cache? To expedite the retrieval of data. The same with memory, the faster the memory subsystem the faster a processor can process data.

m1maverick · Dec 2, 2020

cmaier said:
6502 was CISC, not RISC.

Also, all RISC processors have multiply instructions, and they all take multiple cycles to execute, just like on CISC machines.

The concept of RISC is not what you say it is.

Irrelevant. I used the 6502 to illustrate a point, I never claimed it to be a RISC processor. Please do try and follow along.

m1maverick · Dec 2, 2020

Analog Kid said:
Your example is a bit weird. Arm has a MUL instruction.

I never claimed ARM lacks a multiplication instruction. I used multiplication as an illustration, I used the 6502 because it lacks a multiplication instruction. Geez people, try and follow along.

RISC is less "RISCy" than it used to be. RISC, at least in instruction set size, has moved in the direction of CISC.

Analog Kid · Dec 2, 2020

m1mavrerick said:
The problem is: They don't. Most are shared between the two architectures. Orthogonal instructions make the task of decoding instructions easier but instruction decoding is such a small part of a CPU die that it's not much of an issue any more.

There you go. You're welcome.

So, thanks for nothing?

m1maverick · Dec 2, 2020

Analog Kid said:
So, thanks for nothing?

Don't blame me if you're unable to follow the discussion. Amazing how many Apple fans seem to have an emotional attachment to RISC. Just like they did back in the PPC days. Up until the time Apple transition to Intel.

cgsnipinva · Dec 2, 2020

cmaier said:
AMD has (I believe) been doing SoC since at least bulldozer. I am quoted about it here:

Former AMD engineer talks about Bulldozer

X-bit Labs reports an ex-AMD engineer spoke out about some of the reasons why Bulldozer disappoints: The reason why performance of the long-awaited Bulldozer was below expectations is not only because...

www.guru3d.com

Yet they can match M1 only by running the clock at 166% of M1’s.

interesting so would I be correct in saying that x86 (CISC) architecture does not realize the same benefits of SOC approach as RISC?

cmaier · Dec 2, 2020

cgsnipinva said:
interesting so would I be correct in saying that x86 (CISC) architecture does not realize the same benefits of SOC approach as RISC?

I don’t think there’s a hard or fast rule. It depends on the software stack, what you put on the SoC, How you connect the things on the SoC, etc.

Look at it this way. I once worked at a company that was designing a CPU that was both a RISC processor and a CISC processor. It could run either of two different instructions sets. So the same computing hardware was used for everything. But to support CISC, there was an unavoidable penalty. You have to deal with more complex decoding, which then means there is a bigger penalty when you miss a branch prediction, which means you have to increase the size of the branch prediction unit to compensate. Then you have to add hardware to cope with various weird CISC memory address modes, which means that you are complicating the memory hierarchy subsystem in ways that mean you waste more power when you have a cache miss or have to page to disk. Which means you probably couldn’t do a unified memory architecture in a reasonable way, etc.

In other words, even if I use the exact same micro architecture, CISC will be worse than RISC. In practice, I found it was about 20% performance/power, but times have changed and the difference is apparently much bigger now (because now, for example, a bigger transistor budget means you can have many more parallel pipelines, which is much harder to do in CISC, but was not something we tried back in the day.)

cmaier · Dec 2, 2020

m1mavrerick said:
Nothing like comparing a 2011 architecture to one from 2020.

I was not comparing anything. I was pointing out that “switching to SoC design methodology” does not magically make things better (in fact, in the article, i said it makes things worse). And, of course, it’s certain AMD is *still* doing SoC.

Which is why I pointed out that they need to run at 166% of M1’s clock to keep up - I was comparing a *2020* architecture to a *2020* architecture.

m1maverick · Dec 2, 2020

cmaier said:
I was not comparing anything. I was pointing out that “switching to SoC design methodology” does not magically make things better (in fact, in the article, i said it makes things worse). And, of course, it’s certain AMD is *still* doing SoC.

Which is why I pointed out that they need to run at 166% of M1’s clock to keep up - I was comparing a *2020* architecture to a *2020* architecture.

By comparing a 2011 architecture to a 2020 architecture.

cgsnipinva · Dec 2, 2020

m1mavrerick said:
Why do you say there's no real benefit to using RISC?

I did not say that.

I asked you a question.

m1maverick · Dec 2, 2020

cgsnipinva said:
I did not say that.

I asked you a question.

This is a statement, not a question:

"If there are no real benefits to using RISC..."

Developer Delves Into Reasons Why Apple's M1 Chip is So Fast

macrumors G3

macrumors 68000

macrumors 68000

macrumors 6502

macrumors 6502

macrumors 68000

macrumors 68000

Suspended

macrumors G3

macrumors G3

macrumors 68000

macrumors 68000

Suspended

macrumors G3

macrumors 68000

macrumors 68000

macrumors 68000

macrumors G3

macrumors 68000

macrumors 6502

Suspended

Suspended

macrumors 68000

macrumors 6502

macrumors 68000

Our Staff