Developer Delves Into Reasons Why Apple's M1 Chip is So Fast

EdT · Dec 1, 2020

cmaier said:
Jack is my dog.

View attachment 1684788

See? I told you that I didn’t know him. Hello Jack, glad to meet you!

cgsnipinva · Dec 1, 2020

m1mavrerick said:
Then what is your point?

SPARC is RISC. Isn't RISC all that matters?

If you read the source article no - its the implementation approach of RISC - ARM and SOC designs that made the difference. RISC just offers more flexibility. Now that clock speeds cannot be ramped up at will -- the weaknesses in the CISC architecture are becoming more apparent.

cmaier · Dec 1, 2020

MacBH928 said:
My understanding was that CISC was chosen because its faster, it can run multiple operations at one time while the RISC could run just 1 operation at one time. RISC was used mainly on devices that were more of an appliance mainly due to their low power usage. I mean, even on the *nix side of things they mainly use AMD and Intel which are CISC, no Windows needed there. It doesn't help that Apple abandoned IBM POWER PC(RISC) and switched architectures to Intel(CISC) just because PowerPC was not delivering as much performance.

Risc can absolutely run more than one operation at a time. In fact, it was RISC processors that pioneered superscalar functionality. In fact, it’s far easier to do Parallelism with RISC - that’s something addressed in the article that is linked to in the original post in this thread. This is because there’s a 1:1 mapping of micro ops to ISA instructions in RISC, unlike in CISC, and RISC instructions are almost always decoded in a single pipe stage. So when you create your reorder buffer/reservation stations/superscalar mapper it’s far easier to keep track of things so that when a branch prediction is missed and you need to unwind, or a cache miss occurs, or whatever, you don’t have to figure out which set of micro ops correspond to a single ISA instruction. It makes life easier in a lot of ways, and also means it’s much faster to get instructions *into* the reorder buffer/reservation stations. In CISC you can be limited by the bandwidth getting data into the buffer. In RISC you are more typically limited by the time it takes to identify dependencies (i.e. that one instruction is dependent on the result of another, so they have to issue in a certain order).

cgsnipinva · Dec 1, 2020

m1mavrerick said:
Whaaa??? You mean there were other factors? I wish I had known: That there are other factors at play other than RISC versus CISC.

If only I had known we could have avoided all of this. Oh, wait.

So, IBM was only interested in high-performance variants that consumed more and more power? Really? You mean they wouldn't be interested in developing higher-performance processors that used equal or less power? They were solely focused on higher-performance at more and more power?

Motorola lost interest in capturing a market with higher-performing processor designs? Seems like a very poor business decision to ignore such a huge market.

I worked at IBM and yes - that is an accurate statement. IBM did not want to play in the consumer desktop/laptop market at that time. They were more interested in supporting data centers and the corporate server market. What apple wanted to do with the processors supporting their products was vastly different than what IBM/Motorola had in mind.

Joelist · Dec 1, 2020

cmaier said:
Risc can absolutely run more than one operation at a time. In fact, it was RISC processors that pioneered superscalar functionality. In fact, it’s far easier to do Parallelism with RISC - that’s something addressed in the article that is linked to in the original post in this thread. This is because there’s a 1:1 mapping of micro ops to ISA instructions in RISC, unlike in CISC, and RISC instructions are almost always decoded in a single pipe stage. So when you create your reorder buffer/reservation stations/superscalar mapper it’s far easier to keep track of things so that when a branch prediction is missed and you need to unwind, or a cache miss occurs, or whatever, you don’t have to figure out which set of micro ops correspond to a single ISA instruction. It makes life easier in a lot of ways, and also means it’s much faster to get instructions *into* the reorder buffer/reservation stations. In CISC you can be limited by the bandwidth getting data into the buffer. In RISC you are more typically limited by the time it takes to identify dependencies (i.e. that one instruction is dependent on the result of another, so they have to issue in a certain order).

Which is why Apple created its monster architecture with 8 decoders and a giant reorder buffer and unlike most ARM it uses micro-ops; just not in the way x86 does?

cgsnipinva · Dec 1, 2020

mjtomlin71 said:
The Mac is not having a hard time surviving. Apple sold $9 billion worth in the last quarter alone. Does that sound like a "struggling" product? The Mac user base is currently at 140 million and continues to grow as half of Macs sold today are to people new to the Mac.

The fact is, Apple only makes a handful of computers in a few form factors at the higher end of the market. This naturally limits broader appeal.

The transition to Apple Silicon will enhance the entire Apple ecosystem. Software developers can write for one platform that crosses a large market supported by the iPhone, iPad, Mac, and even Apple TV. That means software developers for Apple Silicon will be writing for a much larger market than just the Mac - but the Mac will benefit as well as the iPhone, iPad, and Apple TV, etc.

cgsnipinva · Dec 1, 2020

cmaier said:
Parallelism isn’t a risc/CISC issue.

You are correct. I fubared on that one. I read your subsequent post responding to another poster and realized I was incorrect.

I always thought RISC just crunched through single instructions at a really fast rate - can you describe how a RISC chip (ARM) does run instructions in parallel. I would find that interesting.

If there is a link to something that would save you time - that would work as well.

theorist9 · Dec 1, 2020

cmaier said:
It actually is a system on chip. The term is used to refer to a design methodology where blocks are independently designed and communicate on a bus rather than via specialized connections. And each of the circuits mentioned (cpu, gpu, neural engines, secure store, memory controller, etc) are indeed on the same silicon chip. The only thing that’s on a separate chip in the package is the RAM.

SOC is used to differentiate from “ASIC,” which is a methodology where each of those blocks would have its own unique interface, and changes made to one block would have effects that affect the others.

Given this, any guesses how Apple could scale this design into a modular/upgradeable Mac Pro?

As I'm sure you know, Apple publicly recognized that, to serve the pro market, they need their pro desktop offering to be modular, resulting in the last Mac Pro. One of the big complaints among pros about the previous trashcan model is that GPU design progressed rapidly, and that model didn't allow pros to upgrade their GPU's to keep up.

But a key to Apple's new design seems to be extensive CPU-GPU integration, thus I don't see how they could allow for GPU-only upgrades without giving that up. Perhaps Apple's answer to pros' modularity requirements would be to allow the addition of new CPU-GPU SoC modules (so you couldn't buy just more GPU power alone). Those modules would then have to be linked to each other. The inter-module communcation speed would be slower than intra-module, thus there would need to be sophisticated traffic control to distribute computational load in a way that minimizes total processing time (i.e., keep as much as possible within each module, passing between modules only as necessary)—just as is, I assume, currently done for different nodes in supercomputers.

I suppose they might be able to offer supplementary CPU-GPU modules that are especially GPU-heavy.

One thing I don't know is whether the sorts of multi-threaded applications used by Pros can distribute their threads across multiple separate processors, or only across multiple cores in the same processor.

Or perhaps they will pursue a hybrid approach for the pros -- a fully integrated central CPU-GPU chip, and then separate GPU-only modules that, while they wouldn't offer the CPU-GPU integration benefits of the the central chip, would provide significant added GPU processing power.

cmaier · Dec 1, 2020

Joelist said:
Which is why Apple created its monster architecture with 8 decoders and a giant reorder buffer and unlike most ARM it uses micro-ops; just not in the way x86 does?

Micro-ops are not the same as microcode, which is, I think, confusing people. Micro ops just is the internal representation of the instruction. The op code, for example, may be broken into multiple additional signals so that downstream steps don’t need to go to the trouble of doing it. Think of micro ops, essentially, as just a different version of the instruction. X86, by contrast, often maps a single instruction into a large set of microcode instructions that need to be performed in a particular sequence. And to figure that out, the instruction stream has to keep guessing as to whether a particular set of bits is the start of one instruction or a different instruction. So what we often do is proceed as if it’s BOTH, and then throw away the work we did for the other. And since instructions may not be aligned to memory lines, you have to keep fetching things and then glue things together, which is part of the reason you need a state machine (am I looking for the start of an instruction? am i looking to see if i need to keep fetching parts of the instruction? am i looking for the end of an instruction?).

m1maverick · Dec 1, 2020

cgsnipinva said:
ARM seems to be doing well.

I didn't see ARM listed on his resume.

m1maverick · Dec 1, 2020

EdT said:
I don’t know cmaier from jack but it’s obvious that he has designed chips and worked with both x86 and RISC designs, something other posters here on MacRumors have verified. What are your credentials?

Two of which are dead, one of which is nearly dead, and the final one which (according to him) is a poor design. It doesn't appear to me that this is something to brag about.

m1maverick · Dec 1, 2020

cgsnipinva said:
I worked at IBM and yes - that is an accurate statement. IBM did not want to play in the consumer desktop/laptop market at that time. They were more interested in supporting data centers and the corporate server market. What apple wanted to do with the processors supporting their products was vastly different than what IBM/Motorola had in mind.

So IBMs goal was to create processors that consumed more and more power? Seems like an odd decision. Oh, I've worked for a number of different companies. Having been an employee of these companies doesn't make me an authority on their strategy.

cmaier · Dec 1, 2020

cgsnipinva said:
You are correct. I fubared on that one. I read your subsequent post responding to another poster and realized I was incorrect.

I always thought RISC just crunched through single instructions at a really fast rate - can you describe how a RISC chip (ARM) does run instructions in parallel. I would find that interesting.

If there is a link to something that would save you time - that would work as well.

It’s actually pretty simple. Each core has multiple ALUs, each of which has an adder, shifter, multiplier (depending on the microarchitecture), etc.

So if the instruction stream looks like this:

ADD R0, R1 -> R2
ADD R3, R4 -> R5
ADD R2, R5 -> R6

Then you build a little scoreboard. The first instruction depends on R0 and R1 and creates R2.
The next instruction does not depend on R2, so it can be issued at the same time as the first instruction.
The last instruction depends on R2 and R5. Since instructions that came before the last instruction change the values of both of these, and since they have not been issued into the pipeline yet, we cannot issue the last instruction until the first two issue. In fact, we can’t probably issue them until they have mostly completed (you don’t actually wait until the first two instructions modify R2 and R5 - it’s good enough that they calculated the results, and then you can short circuit them into the appropriate slots for the third instruction).

One other wrinkle is that when the instruction refers to registers like “R1” or “R2” that doesn’t mean we have a specific register that is always ”R1.” We have a large collection of registers, and we use whatever one is available. There can be multiple versions of R1 floating around, depending on the sequence of instructions. This is called “register renaming“ (https://en.wikipedia.org/wiki/Register_renaming) and everyone does some version of this.

There are LOTS of ways to do this. I owned the scheduler on a SPARC design, and we used reservation stations (https://en.wikipedia.org/wiki/Reservation_station). But there are other techniques as well.

cgsnipinva · Dec 1, 2020

m1mavrerick said:
So IBMs goal was to create processors that consumed more and more power? Seems like an odd decision. Oh, I've worked for a number of different companies. Having been an employee of these companies doesn't make me an authority on their strategy.

No - you are being obtuse - a business decision was made to focus on a market segment where power efficiency was not one the leading feature. When you are on the business side of the organization and are in meetings where such discussions and decisions are made - you can comment on this.

Joelist · Dec 1, 2020

cmaier said:
Micro-ops are not the same as microcode, which is, I think, confusing people. Micro ops just is the internal representation of the instruction. The op code, for example, may be broken into multiple additional signals so that downstream steps don’t need to go to the trouble of doing it. Think of micro ops, essentially, as just a different version of the instruction. X86, by contrast, often maps a single instruction into a large set of microcode instructions that need to be performed in a particular sequence. And to figure that out, the instruction stream has to keep guessing as to whether a particular set of bits is the start of one instruction or a different instruction. So what we often do is proceed as if it’s BOTH, and then throw away the work we did for the other. And since instructions may not be aligned to memory lines, you have to keep fetching things and then glue things together, which is part of the reason you need a state machine (am I looking for the start of an instruction? am i looking to see if i need to keep fetching parts of the instruction? am i looking for the end of an instruction?).

So the article is incorrect in citing the use of micro ops by Apple Silicon? I did get the general point of RISC micro ops being more uniform so that Apple with the WIDE pipe and huge caches and buffers is simply processing a lot more instructions at one time than the others.

cmaier · Dec 1, 2020

theorist9 said:
Given this, any guesses how Apple could scale this design into a modular/upgradeable Mac Pro?

As I'm sure you know, Apple publicly recognized that, to serve the pro market, they need their pro desktop offering to be modular, resulting in the last Mac Pro. One of the big complaints among pros about the previous trashcan model is that GPU design progressed rapidly, and that model didn't allow pros to upgrade their GPU's to keep up.

But a key to Apple's new design seems to be extensive CPU-GPU integration, thus I don't see how they could allow for GPU-only upgrades without giving that up. Perhaps Apple's answer to pros' modularity requirements would be to allow the addition of new CPU-GPU SoC modules (so you couldn't buy just more GPU power alone). Those modules would then have to be linked to each other. The inter-module communcation speed would be slower than intra-module, thus there would need to be sophisticated traffic control to distribute computational load in a way that minimizes total processing time (i.e., keep as much as possible within each module, passing between modules only as necessary)—just as is, I assume, currently done for different nodes in supercomputers.

I suppose they might be able to offer supplementary CPU-GPU modules that are especially GPU-heavy.

One thing I don't know is whether the sorts of multi-threaded applications used by Pros can distribute their threads across multiple separate processors, or only across multiple cores in the same processor.

Or perhaps they will pursue a hybrid approach for the pros -- a fully integrated central CPU-GPU chip, and then separate GPU-only modules that, while they wouldn't offer the CPU-GPU integration benefits of the the central chip, would provide significant added GPU processing power.

My understanding is that apple already has a separate GPU ready-to-go, and we will likely see it mid-2021 in some variation of the MBP and iMac.

The CPU and GPU can still both share RAM even if the GPU is not on the same die as the CPU. It could be that the “discrete” GPU is in the same *package* as the RAM and SoC. Or it could be in a separate package, with a high speed bus and sufficient caching to somehow make it work well.

I also assume that they will keep the on-SoC GPU in this scenario, and rely on it for power savings and extra computational oomph when necessary.

If they decide to do a Mac Pro that supports third party graphics cards, then those people will just not get the benefit of the unified memory architecture. Of course there are other benefits to using some of these cards, so everything is a trade-off.

m1maverick · Dec 1, 2020

cgsnipinva said:
If you read the source article no - its the implementation approach of RISC - ARM and SOC designs that made the difference. RISC just offers more flexibility. Now that clock speeds cannot be ramped up at will -- the weaknesses in the CISC architecture are becoming more apparent.

SoC design? I wish I would have figured that out: but rather Apple placing everything in one package including specialized processing units

Geez people, do read what people are saying instead of just dog piling on.

cgsnipinva · Dec 1, 2020

cmaier said:
It’s actually pretty simple. Each core has multiple ALUs, each of which has an adder, shifter, multiplier (depending on the microarchitecture), etc.

So if the instruction stream looks like this:

ADD R0, R1 -> R2
ADD R3, R4 -> R5
ADD R2, R5 -> R6

Then you build a little scoreboard. The first instruction depends on R0 and R1 and creates R2.
The next instruction does not depend on R2, so it can be issued at the same time as the first instruction.
The last instruction depends on R2 and R5. Since instructions that came before the last instruction change the values of both of these, and since they have not been issued into the pipeline yet, we cannot issue the last instruction until the first two issue. In fact, we can’t probably issue them until they have mostly completed (you don’t actually wait until the first two instructions modify R2 and R5 - it’s good enough that they calculated the results, and then you can short circuit them into the appropriate slots for the third instruction).

One other wrinkle is that when the instruction refers to registers like “R1” or “R2” that doesn’t mean we have a specific register that is always ”R1.” We have a large collection of registers, and we use whatever one is available. There can be multiple versions of R1 floating around, depending on the sequence of instructions. This is called “register renaming“ (https://en.wikipedia.org/wiki/Register_renaming) and everyone does some version of this.

There are LOTS of ways to do this. I owned the scheduler on a SPARC design, and we used reservation stations (https://en.wikipedia.org/wiki/Reservation_station). But there are other techniques as well.

OK that makes sense. Thanks for the information. Informative.

m1maverick · Dec 1, 2020

cgsnipinva said:
No - you are being obtuse - a business decision was made to focus on a market segment where power efficiency was not one the leading feature. When you are on the business side of the organization and are in meetings where such discussions and decisions are made - you can comment on this.

Nothing obtuse about my statement. You said they decided to focus on developing processors which consumed more and more power.

cmaier · Dec 1, 2020

Joelist said:
So the article is incorrect in citing the use of micro ops by Apple Silicon? I did get the general point of RISC micro ops being more uniform so that Apple with the WIDE pipe and huge caches and buffers is simply processing a lot more instructions at one time than the others.

I’m sure they use something that could be called “micro ops.” That doesn’t mean they use microcode, is my point. But a micro op could just be, for example, a fixed 82 bit field that expands on the 64 bits of the normal instruction by adding convenience bits like “this instruction needs the ALU” or “this instruction needs the floating point unit” or whatever. These are things that could be done later in the pipeline, but sometimes knowing these things early on make it easier to schedule the instructions, provide hints to the branch prediction unit, etc.

m1maverick · Dec 1, 2020

cgsnipinva said:
OK that makes sense. Thanks for the information. Informative.

This same explanation applies to CISC as well, it is not limited to RISC.

cgsnipinva · Dec 1, 2020

m1mavrerick said:
SoC design? I wish I would have figured that out: but rather Apple placing everything in one package including specialized processing units

Geez people, do read what people are saying instead of just dog piling on.

We are reading what you are posting but you are not very clear. You have been beating the drum that RISC is not better than CISC and pointing to past history. The point I was making and you clearly missed was that Apple's success with their custom chips is their approach to implementing the ARM standard and what they have packed onto their SOC which includes other customized processors which together provide more upside performance with lower power draw than the x86 alternative.

I clearly stated that the ability to ramp up clock speeds while shrinking manufacturing processes covered up some of the inherent weaknesses in the CISC/x86 architecture. Now that those options are winding down - Apple has found a way to leverage the strengths of RISC/ARM in this current environment to some effect.

cgsnipinva · Dec 1, 2020

m1mavrerick said:
This same explanation applies to CISC as well, it is not limited to RISC.

And yet that was not my question. I knew about CISC running parallel instructions - I was inquiring about the same process on a RISC architecture.

Joelist · Dec 1, 2020

cmaier said:
I’m sure they use something that could be called “micro ops.” That doesn’t mean they use microcode, is my point. But a micro op could just be, for example, a fixed 82 bit field that expands on the 64 bits of the normal instruction by adding convenience bits like “this instruction needs the ALU” or “this instruction needs the floating point unit” or whatever. These are things that could be done later in the pipeline, but sometimes knowing these things early on make it easier to schedule the instructions, provide hints to the branch prediction unit, etc.

Thank you sir!

Overall the article is good. I liked how he not only mentions the use of specialized processor blocks but does other things like explain properly what the Unified Memory Architecture is and how it differs not only from typical PC arrangements but also from shared memory setups like the typical iGPU. And finally he makes it very clear that Apple Firestorm is a VERY different animal from any other ARM microarchitecture and also differs from the tpyical RISC one too.

ozziegn · Dec 1, 2020

now i see it said:
With macs stubbornly stuck at 8% market share for the last two decades - Intel has nothing to worry about. And AMD has never been and never would end up in a Mac any way. People are pretty much set in their ways. Camp Windows or Camp Mac.

I'm by far a business expert but I have to imagine Intel's bottom line will certainly be affected as time moves on with Apple using their own CPUs and not Intel's.

Developer Delves Into Reasons Why Apple's M1 Chip is So Fast

macrumors 68020

macrumors 6502

Suspended

macrumors 6502

macrumors 6502

macrumors 6502

macrumors 6502

macrumors 601

Suspended

macrumors 68000

macrumors 68000

macrumors 68000

Suspended

macrumors 6502

macrumors 6502

Suspended

macrumors 68000

macrumors 6502

macrumors 68000

Suspended

macrumors 68000

macrumors 6502

macrumors 6502

macrumors 6502

macrumors 65816

Our Staff