Developer Delves Into Reasons Why Apple's M1 Chip is So Fast

Analog Kid · Dec 2, 2020

cmaier said:
Seems to me there are two things related to RAM going on here. UMA and in-package RAM. People use “UMA” to refer to both, but I think of them as two different issues.

With respect to the in-package RAM, it is true that the advantage there comes when you have an L2 cache miss. You add around 6ps/mm of latency when your signals have to travel to distant RAM (Plus you have to use bigger drivers, that take more power and get hotter). You typically get around that by using more cache (or more cache levels. This smart guy has something to say about all this: https://www.ecse.rpi.edu/frisc/theses/MaierThesis/ )

The other advantage of UMA is that it avoids time- (and power-) consuming memory transfers. If you have the CPU calculating information that the GPU needs to see, it can just write it into the shared memory, and there is no need to copy it from CPU memory to GPU memory over a bus.

Note that the information the GPU and CPU share may never even make it into the RAM in the package - it may be entirely within the caches (depending on how much information there is and how much other stuff is going on).

Yeah, that's why the first time I tried to explain it, I referred to it as "in package memory"-- but the distinction seemed lost on the audience. Being in package, I believe, would have only a minor impact on the tight loops and minimal data sets in the benchmarks I referenced. Based on the Geekbench reference data from the i5-6400, there are too few misses to make that kind of difference. The i9 and Xeon have significantly more cache than the i5-6400, so would suffer even fewer misses.

I can't see how the unified memory architecture itself would have any impact on the single core CPU benchmark. Best I can tell, the benchmarks are based on external libraries (not Apple's APIs) and aren't making use of the coprocessors. There's a separate Geekbench Compute Workload that would exercise the GPU and the rest of the system.

From what I can tell, the 50% performance advantage the Firestorm core has over the i9 and Xeon is mostly due to the CPU itself.

cmaier · Dec 2, 2020

Analog Kid said:
Yeah, that's why the first time I tried to explain it, I referred to it as "in package memory"-- but the distinction seemed lost on the audience. Being in package, I believe, would have only a minor impact on the tight loops and minimal data sets in the benchmarks I referenced. Based on the Geekbench reference data from the i5-6400, there are too few misses to make that kind of difference. The i9 and Xeon have significantly more cache than the i5-6400, so would suffer even fewer misses.

I can't see how the unified memory architecture itself would have any impact on the single core CPU benchmark. Best I can tell, the benchmarks are based on external libraries (not Apple's APIs) and aren't making use of the coprocessors. There's a separate Geekbench Compute Workload that would exercise the GPU and the rest of the system.

From what I can tell, the 50% performance advantage the Firestorm core has over the i9 and Xeon is mostly due to the CPU itself.

Seems likely. If, as reported, it can issue ~8 instructions per cycle at peak, and given the relatively short pipeline and low branch prediction miss penalty, you can get much of that 50% right there.

theorist9 · Dec 2, 2020

Analog Kid said:
Yeah, that's why the first time I tried to explain it, I referred to it as "in package memory"-- but the distinction seemed lost on the audience. Being in package, I believe, would have only a minor impact on the tight loops and minimal data sets in the benchmarks I referenced. Based on the Geekbench reference data from the i5-6400, there are too few misses to make that kind of difference. The i9 and Xeon have significantly more cache than the i5-6400, so would suffer even fewer misses.

I can't see how the unified memory architecture itself would have any impact on the single core CPU benchmark. Best I can tell, the benchmarks are based on external libraries (not Apple's APIs) and aren't making use of the coprocessors. There's a separate Geekbench Compute Workload that would exercise the GPU and the rest of the system.

From what I can tell, the 50% performance advantage the Firestorm core has over the i9 and Xeon is mostly due to the CPU itself.

The GB documentation says the the SQLite subtest "measures the transaction rate a device can sustain with an in-memory SQL database." Would this measure the effect of in-package memory?

The overall single-core GB score of the M1 is ~20% greater than that of the i7-1165G7. On that specific subtest, from a quick scan of several single-core results for each processor, it appears the M1 scores ~10% better.

cmaier · Dec 2, 2020

theorist9 said:
The GB documentation says the the SQLite subtest "measures the transaction rate a device can sustain with an in-memory SQL database." Would this measure the effect of in-package memory?

The overall single-core GB score of the M1 is ~20% greater than that of the i7-1165G7. On that specific subtest, from a quick scan of several single-core results for each processor, it appears the M1 scores ~10% better.

Whether it tests the effect of in-package memory latency would depend on the memory access pattern. If the memory accesses are such that the latency doesn’t matter (because the benchmark can keep busy with other things) then you wouldn’t see much effect. Similarly, if the memory accesses are fairly linearly, then the cache fill algorithm might fetch data ahead of when it is needed, depending on proximity of accesses. Hard to tell what’s going on just from these sorts of benchmarks.

Analog Kid · Dec 2, 2020

theorist9 said:
The GB documentation says the the SQLite subtest "measures the transaction rate a device can sustain with an in-memory SQL database." Would this measure the effect of in-package memory?

The overall single-core GB score of the M1 is ~20% greater than that of the i7-1165G7. On that specific subtest, from a quick scan of several single-core results for each processor, it appears the M1 scores ~10% better.

Page 25 says 0 L3 cache misses for the SQL-lite test. I can’t image it’s worse on the i9 or Xeon. I’m still referring back to the Apple Insider chart I posted earlier for the 50% number (M1 looks roughly 50% faster than the i9 16” MBP and 2019 MP).

mjtomlin71 · Dec 3, 2020

MacBH928 · Dec 3, 2020

cgsnipinva said:
When did I say that? What I said was that IBM was not really focusing on the needs of Apple with regards to the PowerPC chip because their focus was on the corporate market (servers, high end workstations, data centers, etc) than the consumer market. As a result, the development priorities and schedule did not align with Apple's.

That was the primary cause for Apple's transition to Intel. However, at the time even Apple indicated this would be a transition period and not a permanent move.

And they were correct.

thedocbwarren said:
Power was run on the server-side. PowerPC is low-end. They ran AIX and Linux depending.

excuse my ignorance

1)I forgot IBM main business was building servers for enterprise

2)I didn't know they built different CPUs for servers I thought server CPUs are booster PC CPUs

3)I am surprised IBM still makes CPUs, I thought they quit after they quit the consumer market. Surely building CPUs for consumer is more profitable than just servers.(Unless CPU server costs like $10K each or something)

4)I thought servers are mainly running Intel and AMD by now

5)

cgsnipinva said:
What apple wanted to do with the processors supporting their products was vastly different than what IBM/Motorola had in mind

I thought IBM/Motorola partnership was on the PowerPC cpu.

cmaier · Dec 3, 2020

mjtomlin71 said:
Wow! 6502... that brings back memories! I first learned BASIC when I was 12. Then taught myself how to program 6502 on my Apple II, because BASIC was way too slow to do anything graphical.

/nostalgia

I learned basic on a trs-80 model 1 in a special small class during third grade that met an hour before school started, on what was the first computer in our school district. Never got much beyond peek and poke on the apple ii, though it became the main machine of my childhood (at school. First computer in our house was a ti-99/4a.) I coded the TI a lot more than the apples. I did have to learn 6502 assembler in college, as part of one of the lab classes. We’d have to design various interface boards and wire wrap things to do silly things like use an oscilloscope as a vector display, etc.

mjtomlin71 · Dec 3, 2020

mjtomlin71 · Dec 3, 2020

MacBH928 · Dec 3, 2020

m1mavrerick said:
Memory constraints. CISCs benefit is reduced instruction set size. Let's take multiplication as an example. It consumes less memory to use a single multiplication instruction than it is to construct that same multiplication using additions and shifts. Also, it's likely to be faster as it's executed in "hardware".

To illustrate using multiplication. Here is a multiplication routine written in 6502 assembly language (the 6502 does not have a multiplication instruction):

Code:

;32 bit multiply with 64 bit product MULTIPLY: lda #$00 sta PROD+4 ;Clear upper half of sta PROD+5 ;product sta PROD+6 sta PROD+7 ldx #$20 ;Set binary count to 32 SHIFT_R: lsr MULR+3 ;Shift multiplyer right ror MULR+2 ror MULR+1 ror MULR bcc ROTATE_R ;Go rotate right if c = 0 lda PROD+4 ;Get upper half of product clc ; and add multiplicand to adc MULND ; it sta PROD+4 lda PROD+5 adc MULND+1 sta PROD+5 lda PROD+6 adc MULND+2 sta PROD+6 lda PROD+7 adc MULND+3 ROTATE_R: ror a ;Rotate partial product sta PROD+7 ; right ror PROD+6 ror PROD+5 ror PROD+4 ror PROD+3 ror PROD+2 ror PROD+1 ror PROD dex ;Decrement bit count and bne SHIFT_R ; loop until 32 bits are clc ; done lda MULXP1 ;Add dps and put sum in MULXP2 adc MULXP2 sta MULXP2 rts

Using CISC, assuming the processor has a multiplication instruction, one could merely write:

Code:

mul R1, R2 ; Multiply the values of R1 and R2

Note: The CISC instruction is hypothetical and is intended for illustrative purposes only.

As you can see the CISC code is much, much smaller. As for execution time it depends on how quickly a processor can execute all of the instructions used to build the multiplication routine versus the time taken to execute a single multiplication instruction.

The concept of RISC is each instruction can be executed very quickly and therefore the sum to execute all of them would be less than that of a single multiplication instruction. For CISC the idea is that the built in MUL op code would execute faster than a bunch of smaller op codes. However not all processor instructions are actually hardwired. They are built upon microcode which is, for this discussion, a small program within the processor itself.

Once again the 80/20 rule can be invoked. RISC designers found that applications primarily (80%) consist of a small number of frequently used instructions and optimized for them. The other 20% could, as the multiplication example above shows, be built from those instructions.

HTH

so...they didn't use RISC because it required a lot of memory and now that there is a lot of memory and its cheap RISC is better?

firewood · Dec 3, 2020

MacBH928 said:
so...they didn't use RISC because it required a lot of memory and now that there is a lot of memory and its cheap RISC is better?

The 6502, and other processors of that vintage didn't use RISC explicitly because fitting in any kind of modern ISA using less than 4000 transistors and no more than 40 slow IO pins is difficult enough. Generic RISC ISAs require lots of registers, and either a cache or a wide fast memory bus. Moore's Law didn't provide enough transistors until roughly a decade later.

A decade later was too late for the 8086 architects, but just in time for Berkeley RISC, Stanford MIPS, HPPA, SPARC, AMD 29000, Motorola 88000, and the Acorn Risc Machine (ARM), et.al. to do a modern ISA that wasn't stuck with a bunch of ugly hacks needed to wedge a messy CISC ISA implementation into a much smaller number of transistors.

cmaier · Dec 3, 2020

firewood said:
The 6502, and other processors of that vintage didn't use RISC explicitly because fitting in any kind of modern ISA using less than 4000 transistors and no more than 40 slow IO pins is difficult enough. Generic RISC ISAs require lots of registers, and either a cache or a wide fast memory bus. Moore's Law didn't provide enough transistors until roughly a decade later.

A decade later was too late for the 8086 architects, but just in time for Berkeley RISC, Stanford MIPS, HPPA, SPARC, AMD 29000, Motorola 88000, and the Acorn Risc Machine (ARM), et.al. to do a modern ISA that wasn't stuck with a bunch of ugly hacks needed to wedge a messy CISC ISA implementation into a much smaller number of transistors.

It was also an era where you needed to do very un-RISC-like things just to get things to work. So you would do bits and pieces of the instruction decoding in various places, before and after latches. You’d copy signals to avoid having to copy logic. People would build instruction op codes (assigning them binary values) with the idea that you could choose carefully in order to avoid having to add a gate to figure things out. The problem with all of this is you were assuming there would never be a new chip, with more capability, that needed to use the same ISA

Joelist · Dec 3, 2020

Since Apple has not seen fit to disclose exactly how the Firestorm microarchitecture is set up, we are left using tests and other tools to make guesses. This is why upthread I noted Anandtech has an edge in that Anand Lal Shimpi is part of the Apple Silicon team. And based on their (and others) work we think Firestorm is a VERY wide but short pipe. It seems to have 8 decoders with at least 6 ALU and a huge ROB. So it is massively parallel and seems to be able to issue at least 8 instructions per cycle consistently.

@cmaier would likely know better than I but is it possible that the advantage here of RISC is that the micro-ops being used (which apparently it has been established that Apple Silicon unlike a lot of ARM uses micro-ops) are consistent in their size which allows even faster allocation to pipes and thus processing?

This thing actually remonds me a little of the old Intel Banias and Dothan architectures not to mention Conroe - all were short, fat pipes with out of order executon and really good branch preductors.

cmaier · Dec 3, 2020

Joelist said:
Since Apple has not seen fit to disclose exactly how the Firestorm microarchitecture is set up, we are left using tests and other tools to make guesses. This is why upthread I noted Anandtech has an edge in that Anand Lal Shimpi is part of the Apple Silicon team. And based on their (and others) work we think Firestorm is a VERY wide but short pipe. It seems to have 8 decoders with at least 6 ALU and a huge ROB. So it is massively parallel and seems to be able to issue at least 8 instructions per cycle consistently.

@cmaier would likely know better than I but is it possible that the advantage here of RISC is that the micro-ops being used (which apparently it has been established that Apple Silicon unlike a lot of ARM uses micro-ops) are consistent in their size which allows even faster allocation to pipes and thus processing?

This thing actually remonds me a little of the old Intel Banias and Dothan architectures not to mention Conroe - all were short, fat pipes with out of order executon and really good branch preductors.

The term “microOp” has flexible meaning. It’s often used to refer to microcode ops, and also to refer to statelessly-decoded ISA ops (in other words, just a different version of the ISA op with more bits representing more detailed control signals). Everyone decodes ISA ops into micro ops, includiNg all other Arm vendors. Unless there is some suggestion that Apple is using microcode (i’ve seen no such suggestion), then I don’t think there’s much magical about the microops.

that said, the advantage apple has over CISC is that there is a fixed correspondence between microops and ISA instructions, most instructions don’t access memory, they have big register files, and they have a short pipeline. This all makes it possible to have a big reorder buffer that presumably takes only a single pipe stage to determine what ops to issue.

Their advantage over other Arm vendors *appears* to be more pipelines in parallel - specifically more ALU pipelines. They’ve also adopted a heterogeneous approach to those, putting multipliers only in some of them (a very common technique) and apparently having a separate pipeline just for divide (not so common a technique).

The problem with more parallel ALUs is that it is tough to keep them busy, because of instruction interdependencies. Apple has a good idea what the ideal amount is, though, because it is so intimately familiar with the entire software stack, so presumably they hit the sweet spot.

thedocbwarren · Dec 3, 2020

cmaier said:
The term “microOp” has flexible meaning. It’s often used to refer to microcode ops, and also to refer to statelessly-decoded ISA ops (in other words, just a different version of the ISA op with more bits representing more detailed control signals). Everyone decodes ISA ops into micro ops, includiNg all other Arm vendors. Unless there is some suggestion that Apple is using microcode (i’ve seen no such suggestion), then I don’t think there’s much magical about the microops.

that said, the advantage apple has over CISC is that there is a fixed correspondence between microops and ISA instructions, most instructions don’t access memory, they have big register files, and they have a short pipeline. This all makes it possible to have a big reorder buffer that presumably takes only a single pipe stage to determine what ops to issue.

Their advantage over other Arm vendors *appears* to be more pipelines in parallel - specifically more ALU pipelines. They’ve also adopted a heterogeneous approach to those, putting multipliers only in some of them (a very common technique) and apparently having a separate pipeline just for divide (not so common a technique).

The problem with more parallel ALUs is that it is tough to keep them busy, because of instruction interdependencies. Apple has a good idea what the ideal amount is, though, because it is so intimately familiar with the entire software stack, so presumably they hit the sweet spot.

I like this explanation. Very clear. Thank you for that. It's why I suggested it's such an advantage to own the software and hardware stack earlier vs Intel and Windows. WIntel is so disconnected, generic, and with different goals, it's no wonder it isn't optimal at this point. Add macOS to the mix and even more so.

We've had this RISC vs CISC argument going on on the tread over nuances that I believe are irrelevant. I'd suggest given the actual advantages of the M1 over lower-powered Intel or AMD chips in the same form factors.

Given the nonsense around the thread here and there, I enjoyed the discussion (especially on Power as I worked on that team.)

Personally I'm excited by the design and have wanted an ARM system to work with. Have a few SoC ARM devices I'd worked with and like them even if limited. This is the first time I've seen something this well thought out and for specific desktop/laptop use. Exciting times.

Analog Kid · Dec 3, 2020

cmaier said:
It was also an era where you needed to do very un-RISC-like things just to get things to work. So you would do bits and pieces of the instruction decoding in various places, before and after latches. You’d copy signals to avoid having to copy logic. People would build instruction op codes (assigning them binary values) with the idea that you could choose carefully in order to avoid having to add a gate to figure things out. The problem with all of this is you were assuming there would never be a new chip, with more capability, that needed to use the same ISA

I think this is the bit people miss... CISC is a concept defined by its alternative and grafted onto history.

At the time, no one was pursuing a "CISC philosophy", they were simply making the most of the resources they had at the time. Strategy didn't eclipse tactics until there were enough resources and experience available to rethink the architecture and support it with machine compilation.

firewood · Dec 3, 2020

Analog Kid said:
To repeat myself, the unified memory will help in the event of a cache miss, but these are quite rare in Geekbench workloads (0.2%). You'll have to convince me the difference in memory latency and throughput make the 50% difference in that particular benchmark.

The performance difference due to the M1 memory system is likely for a very different reason than just reduced cache miss latency (although that does help the performance of apps with really large memory footprints).

The difference is that any extra IO pads/pins needed to go to an off-die bus to a separate GPU memory requires a ton of power for high enough bandwidth. The M1's shared in-package memory busses can use far less power than that.

Which means the M1 can use that non-wasted IO pin power budget for the CPU cores instead (running more cores faster, etc.), and still stay within the thermal envelope and provide longer battery life.

Analog Kid · Dec 3, 2020

firewood said:
The performance difference due to the M1 memory system is likely for a very different reason than just reduced cache miss latency (although that does help the performance of apps with really large memory footprints).

The difference is that any extra IO pads/pins needed to go to an off-die bus to a separate GPU memory requires a ton of power for high enough bandwidth. The M1's shared in-package memory busses can use far less power than that.

Which means the M1 can use that non-wasted IO pin power budget for the CPU cores instead (running more cores faster, etc.), and still stay within the thermal envelope and provide longer battery life.

While the premise is certainly true, that in package RAM reduces overall power consumption, I’m not convinced it explains the benchmarking difference For a few reasons: the buffers occupy a different area of silicon from the processing cores so they don’t heat the cores very efficiently, the RAM is now in the same package sharing the same heat sink so if heat in different parts of the package were a concern then I’m not sure there’s a net reduction in power in the package, if there are very few cache misses then there isn’t much traffic to generate heat in any event...

Interesting thought though. I still think the CPU design is the major reason for the performance difference though.

Tech198 · Dec 3, 2020

This leaves more simplicity for buying, and also closes the gap as "which machine is faster" .... That's not really an option anymore when it's always 8 core or 7 core...

It's not like there is a massive buying decision.This proabbly is also why Apple didn't really sem it was necessary for CPU upgrades.. Because you have a fast CPU anyway, upgrade to what? No apps will take advange of it even if one was available.

DummyFool · Dec 3, 2020

nothingtoseehere said:
Which engine do you use that runs under macOS/M1 as well as under PC? With Rosetta 2?

I use Stockfish which is considered the strongest by most. You can compile it as native under M1 and initial benchmarks say it's very very fast for a thin and light laptop but it's nowhere near a recent high end Intel desktop.

I do still think it's fantastic and I am debating whether to buy one or wait for a Macbook Pro 14 or 16. I just felt felt the need to temper the peoples who are saying Intel is obsolete! It is just not the case if you do data crunching or any application of that kind. Maybe in the future but not now.

DummyFool · Dec 3, 2020

DummyFool said:
I use Stockfish which is considered the strongest by most. You can compile it as native under M1 and initial benchmarks say it's very very fast for a thin and light laptop but it's nowhere near a recent high end Intel desktop and even an M1X with 12 cores won't get to that based on results of the M1.

I do still think it's fantastic and I am debating whether to buy one or wait for a Macbook Pro 14 or 16. I just felt felt the need to temper the peoples who are saying Intel is obsolete! It is just not the case if you do data crunching or any application of that kind. Maybe in the future but not now. The fact that Apple as put a lot of transistors for video editing and the fact that most reviewer swear by that as skewed the reviews.

jumpcutking · Dec 4, 2020

I really need someone to do a benchmark with After Effects on the M1 compared to an I7 pleaseeeeeeeeee.... lol. That will be my honest "true" teller of Apple Silicon's performance. Of course... we have to wait for Apple Silicon versions of the apps to be available.

MacBH928 · Dec 4, 2020

firewood said:
The 6502, and other processors of that vintage didn't use RISC explicitly because fitting in any kind of modern ISA using less than 4000 transistors and no more than 40 slow IO pins is difficult enough. Generic RISC ISAs require lots of registers, and either a cache or a wide fast memory bus. Moore's Law didn't provide enough transistors until roughly a decade later.

A decade later was too late for the 8086 architects, but just in time for Berkeley RISC, Stanford MIPS, HPPA, SPARC, AMD 29000, Motorola 88000, and the Acorn Risc Machine (ARM), et.al. to do a modern ISA that wasn't stuck with a bunch of ugly hacks needed to wedge a messy CISC ISA implementation into a much smaller number of transistors.

Its interesting and scary how once a standard is made its so hard to abandon, reminds me of US 120V electricity that they first installed by Edison then Tesla came with the 220V which is better but US decided they are already too deep into it so a 100 years later US still on 120V. At least, this is what I understood from it.

But in my opinion, all it really needs is a courageous decision maker that is not afraid like Apple jumping to intel then jumping back to RISC using their own built CPUs. As their "Think Different..." ad campaign goes "here to the crazy ones...!"

Frantisekj · Dec 4, 2020

Interesting insight into Apple roadmap and architecture advancements

Developer Delves Into Reasons Why Apple's M1 Chip is So Fast

macrumors G3

Suspended

macrumors 601

Suspended

macrumors G3

macrumors 6502

macrumors G3

Suspended

macrumors 6502

macrumors 6502

macrumors G3

macrumors G3

Suspended

macrumors 6502

Suspended

macrumors 6502

macrumors G3

macrumors G3

macrumors G3

Cancelled

macrumors regular

macrumors regular

macrumors 6502

macrumors G3

macrumors 6502a

Our Staff