Developer Delves Into Reasons Why Apple's M1 Chip is So Fast

Joelist · Dec 4, 2020

And as we learned it is actually 7 ALU...

crazy dave · Dec 4, 2020

So I'm not a chip designer and my main parallel programming experience is with GPUs rather than CPUs, but I'd like to think that I know some basics about basic technical definitions.

This article from wfcctech (which yeah is not the most reliable of websites to begin with) seems to be beyond clickbait - even beyond fundamentally flawed to use its own language:

https://wccftech.com/why-apple-m1-single-core-comparisons-are-fundamentally-flawed-with-benchmarks/

The author appears to be confusing logical cores with physical cores in order to claim that single thread (single [logical] core) benchmarks are not comparable between physical cores that have SMT2/HT enabled and physical cores like Apple's m1 firestorm which are single thread only. And he is accusing Apple of being deceitful despite that Apple is using the same definitions we've all used going on decades now when they claim to have a fast single core and the fastest anywhere near its power usage. My understanding is that benchmarking suites often call single-threaded tests a single core test because they run on one logical core. I mean maybe we should all try to use the term single thread instead to make it more precise and avoid this kind of confusion, but given that this is supposedly a tech writer this may be an example of bad faith rather than genuine confusion.

I'd like to think I'm not off base with the above and I don't think I am as apparently the Anandtech writers were trying to ... correct his miscomprehension about how CPUs work on Twitter but he blocked (at least one of) them.

Fomalhaut · Dec 4, 2020

crazy dave said:
So I'm not a chip designer and my main parallel programming experience is with GPUs rather than CPUs, but I'd like to think that I know some basics about basic technical definitions.

This article from wfcctech (which yeah is not the most reliable of websites to begin with) seems to be beyond clickbait - even beyond fundamentally flawed to use its own language:

https://wccftech.com/why-apple-m1-single-core-comparisons-are-fundamentally-flawed-with-benchmarks/

The author appears to be confusing logical cores with physical cores in order to claim that single thread (single [logical] core) benchmarks are not comparable between physical cores that have SMT2/HT enabled and physical cores like Apple's m1 firestorm which are single thread only. And he is accusing Apple of being deceitful despite that Apple is using the same definitions we've all used going on decades now when they claim to have a fast single core and the fastest anywhere near its power usage. My understanding is that benchmarking suites often call single-threaded tests a single core test because they run on one logical core. I mean maybe we should all try to use the term single thread instead to make it more precise and avoid this kind of confusion, but given that this is supposedly a tech writer this may be an example of bad faith rather than genuine confusion.

I'd like to think I'm not off base with the above and I don't think I am as apparently the Anandtech writers were trying to ... correct his miscomprehension about how CPUs work on Twitter but he blocked (at least one of) them.

Even if we accepted that a comparison of a single-thread score from a hyper-threaded CPU core, versus a single-threaded core were "unfair", the fact remains that real-world usage tests show the M1 to be faster for many tasks compared to comparable AMD/Intel CPUs. Application and OS responsiveness is also excellent which implies that single-threaded performance is good (although without knowledge of the threading model of specific application this is hard to quantify).

segfaultdotorg · Dec 5, 2020

ZipZilla said:
what about the poor SOB's that bought the Core Solo? Best to wait 6 months here.

Forgot about those--were they on the MacBooks? My first Intel Mac was a Core 2 Duo polycarbonate iMac.

mr_roboto · Dec 11, 2020

cmaier said:
As for SLC cache, that’s for flash, not for the memory subsystem.

This is from a while back, but I thought I'd mention that I believe it actually is a reference to the memory subsystem, not flash. As you well know, many high performance SoCs like M1 have a giant last-layer cache (LLC) shared by all agents which access DRAM. (M1's is 16MB.) Some people have begun referring to whole-SoC LLCs as "System-Level Cache", or SLC. So now we've got an acronym overload on SLC...

mjtomlin71 said:
That 20W-25W figure was for the entire M1 SoC, including CPU, GPU, RAM, and everything else. These figures came off the power draw increase at the A/C plug.

Not sure why they didn't run powermetrics to get more accurate figures? But, several "reviewers" had it running... it showed the CPU maxed out drew ~13W, and the GPU maxed out, drew ~6W.

I just used powermetrics on my M1 Air while running a very simple all-cores load (eight copies of 'yes >/dev/null'). Initial whole-package power draw was around 23W, with all cores locked at max frequency (2064 MHz for the E-Cluster, 3204 MHz for the P-Cluster). The E-Cluster was using ~1350mW, the P-Cluster ~21300mW.

That lasted about 20 seconds, then the system began ramping P-Cluster clocks down to reduce power. I'm now at about 12 minutes into the test and the E-Cluster is still at 2064 MHz and ~1350mW, while the P-Cluster is ~2210 MHz and ~9850mW.

"yes" isn't a very good load. It puts little stress on DRAM and isn't even a good workout of the CPU (no FP math). But it can generate a decent lower bound on how much power M1 CPUs can use, and that seems to be over 5W per P core (including its share of overhead in the P-Cluster, which includes a large L2 cache shared by all P-Cluster cores).

Since I've seen reports that the GPU has been observed (via powermetrics) at ~10W, the whole M1 should be able to draw at least 35W.

mjtomlin71 said:
That efficiency will eventually enable a level of performance that neither AMD or Intel will be able to match. Take the M1 running at 5GHz, that's still a CPU with a ~25W TDP. That's a sustainable TDP, no "turbo" mode needed, not even in a laptop.

I agree that the revolution is how power efficient M1 is, but don't have unrealistic ideas about it scaling up by adding power. M1 (and anything built around the A14/M1 CPU cores) likely will never run as fast as 5 GHz, or even 4 GHz. When you design chips with great attention paid to power efficiency, as Apple does, there's not going to be enough timing slack to increase clocks substantially over stock without resorting to extreme overclocker stuff like high voltages, low temperatures, and so on. Apple won't want to do that, even in desktop Macs, so large increases over 3.2 GHz are unlikely. Instead, expect them to take advantage of bigger power budgets by adding more CPU and GPU cores.

The big win in bigger M series chips is going to be that, unlike Intel and even AMD, Apple won't have to restrict clock speeds much (or at all) based on how many cores are active. Intel chips have these flashy max clock speed numbers, but can never sustain those clocks if more than one or two cores active because one Intel core running at max frequency needs about 20W (or more in some cases, IIRC). Apple's matching or exceeding that single thread performance at ~5W and 3.2 GHz, which means they can run four cores at max ST performance for the same power as one Intel core at max ST performance. That's going to translate into better multicore scaling, far less power to run most tasks, and so on.

cmaier · Dec 11, 2020

mr_roboto said:
This is from a while back, but I thought I'd mention that I believe it actually is a reference to the memory subsystem, not flash. As you well know, many high performance SoCs like M1 have a giant last-layer cache (LLC) shared by all agents which access DRAM. (M1's is 16MB.) Some people have begun referring to whole-SoC LLCs as "System-Level Cache", or SLC. So now we've got an acronym overload on SLC...

I just used powermetrics on my M1 Air while running a very simple all-cores load (eight copies of 'yes >/dev/null'). Initial whole-package power draw was around 23W, with all cores locked at max frequency (2064 MHz for the E-Cluster, 3204 MHz for the P-Cluster). The E-Cluster was using ~1350mW, the P-Cluster ~21300mW.

That lasted about 20 seconds, then the system began ramping P-Cluster clocks down to reduce power. I'm now at about 12 minutes into the test and the E-Cluster is still at 2064 MHz and ~1350mW, while the P-Cluster is ~2210 MHz and ~9850mW.

"yes" isn't a very good load. It puts little stress on DRAM and isn't even a good workout of the CPU (no FP math). But it can generate a decent lower bound on how much power M1 CPUs can use, and that seems to be over 5W per P core (including its share of overhead in the P-Cluster, which includes a large L2 cache shared by all P-Cluster cores).

Since I've seen reports that the GPU has been observed (via powermetrics) at ~10W, the whole M1 should be able to draw at least 35W.

I agree that the revolution is how power efficient M1 is, but don't have unrealistic ideas about it scaling up by adding power. M1 (and anything built around the A14/M1 CPU cores) likely will never run as fast as 5 GHz, or even 4 GHz. When you design chips with great attention paid to power efficiency, as Apple does, there's not going to be enough timing slack to increase clocks substantially over stock without resorting to extreme overclocker stuff like high voltages, low temperatures, and so on. Apple won't want to do that, even in desktop Macs, so large increases over 3.2 GHz are unlikely. Instead, expect them to take advantage of bigger power budgets by adding more CPU and GPU cores.

The big win in bigger M series chips is going to be that, unlike Intel and even AMD, Apple won't have to restrict clock speeds much (or at all) based on how many cores are active. Intel chips have these flashy max clock speed numbers, but can never sustain those clocks if more than one or two cores active because one Intel core running at max frequency needs about 20W (or more in some cases, IIRC). Apple's matching or exceeding that single thread performance at ~5W and 3.2 GHz, which means they can run four cores at max ST performance for the same power as one Intel core at max ST performance. That's going to translate into better multicore scaling, far less power to run most tasks, and so on.

I’ve never heard LLC referred to as SLC. I *have* heard it referred to as system cache (without “level”). If people are doing that, they should stop.

09872738 · Dec 12, 2020

ZipZilla said:
what about the poor SOB's that bought the Core Solo? Best to wait 6 months here.

False equvalency: back then it was clear the Core Solo was outgoing... the M1 OTOH is brand spanking new

topcat001 · Dec 17, 2020

hugodrax said:
I have not been this excited about owning a new "Unix RISC workstation" since my SGI Indigo2 Max Impact.

The powerful RISC Unix workstations with exotic co-processors are back.

I fondly remember my O2 which I had in Uni during my undergrad, postgrad, and PhD years. It was a pretty good latex machine to write papers while I was working, and playing with OpenGL and such at other times. Yes the good times are back!

perris · Mar 12, 2021

cmaier said:
Because I didn’t keep working on PowerPCs. Duh.

Joined this board just because of this discussion, it's a lot of fun, and wanted to ask you specifically cmaier, (really is an incredible resume btw)

Why isn't the amount of transistors the deciding factor in the m1?

Intel=8 billion, AMD=10 billion, Apple 5nm = 16 billion.

When AMD get's their 5nm, and if Apple does not get 4 or 3 at the same time, how do you think the two will compare

Two part question, here's part 2;

I do believe AMD is more hybrid than cisc, and Arm is more hybrid than risc, true?

Thanks in advanced cmaier

cmaier · Mar 12, 2021

perris said:
Joined this board just because of this discussion, it's a lot of fun, and wanted to ask you specifically cmaier, (really is an incredible resume btw)

Why isn't the amount of transistors the deciding factor in the m1?

Intel=8 billion, AMD=10 billion, Apple 5nm = 16 billion.

When AMD get's their 5nm, and if Apple does not get 4 or 3 at the same time, how do you think the two will compare

Two part question, here's part 2;

I do believe AMD is more hybrid than cisc, and Arm is more hybrid than risc, true?

Thanks in advanced cmaier

Well, more transistors doesn’t do you any good if you don’t put them to good use, right?

If you think about single-threaded operations, there are only so many transistors you can put to good use. Adding more functional units (adders, multipliers, shifters, etc.) has diminishing returns after 4. Beyond 8 it really is not likely to be worth it. Adding more cache is nice, but you can only add so much before you get diminishing returns (both because you aren’t using it all, and because the more you have, the more distant some of it will be from the core, meaning the slower it will be). Same with branch prediction unit, etc.

So, if you look into it, most ALUs (the part of the CPU that does integer math) use about the same number of transistors (within the same order of magnitude. I’m speaking of this in a qualitative, not quantitative, sense.).

For parallel operations you can add more cores. That, too, has diminishing returns. 8 is probably more than enough for 95% of uses, and the benefit falls off rapidly each time you double the number of cores.

So what apple did with the transistors is use a lot of them for other types of computation - graphics, machine learning, etc. etc. These are all good uses, but they don’t explain why M1 destroys Intel even for traditional CPU functions that make no use of all this other circuitry.

On the same process, I would expect AMD to lag Apple by around 10-20% in performance/power, because of the CISC penalty. Of course AMD may run faster, but that’s only because they choose to burn a lot more power to do it, for example. That’s why we think in terms of performance/watt. AMD probably uses a similar design methodology to Apple, so I would think any differences in performance/watt would be a pretty good indication of the benefits of Arm vs. x86-64.

As for AMD, having designed many processors for them, I wouldn’t describe them as hybrid. They are CISC. CISC and RISC mean different things to different people, and can mean different things depending on the context, but there are two things that any CPU designer would likely agree scream “this is CISC!” And AMD and x86 have both of them:

1) to decode the instructions, you need a state machine. In other words, you have to take multiple clock cycles to decode the instruction, because you have to make multiple passes over the instruction stream to figure out what is going on, convert the instructions to different kinds of instructions, detect dependencies, etc.

2) any arbitrary instruction can access memory. In RISC, only LOAD and STORE instructions (and maybe one or two others depending on the architecture) can access memory. This greatly simplifies what is necessary for dealing with register mapping, inter-instruction dependencies for scheduling, etc.

perris · Mar 12, 2021

cmaier said:
Well, more transistors doesn’t do you any good if you don’t put them to good use, right?

If you think about single-threaded operations, there are only so many transistors you can put to good use. Adding more functional units (adders, multipliers, shifters, etc.) has diminishing returns after 4. Beyond 8 it really is not likely to be worth it. Adding more cache is nice, but you can only add so much before you get diminishing returns (both because you aren’t using it all, and because the more you have, the more distant some of it will be from the core, meaning the slower it will be). Same with branch prediction unit, etc.

So, if you look into it, most ALUs (the part of the CPU that does integer math) use about the same number of transistors (within the same order of magnitude. I’m speaking of this in a qualitative, not quantitative, sense.).

For parallel operations you can add more cores. That, too, has diminishing returns. 8 is probably more than enough for 95% of uses, and the benefit falls off rapidly each time you double the number of cores.

So what apple did with the transistors is use a lot of them for other types of computation - graphics, machine learning, etc. etc. These are all good uses, but they don’t explain why M1 destroys Intel even for traditional CPU functions that make no use of all this other circuitry.

On the same process, I would expect AMD to lag Apple by around 10-20% in performance/power, because of the CISC penalty. Of course AMD may run faster, but that’s only because they choose to burn a lot more power to do it, for example. That’s why we think in terms of performance/watt. AMD probably uses a similar design methodology to Apple, so I would think any differences in performance/watt would be a pretty good indication of the benefits of Arm vs. x86-64.

As for AMD, having designed many processors for them, I wouldn’t describe them as hybrid. They are CISC. CISC and RISC mean different things to different people, and can mean different things depending on the context, but there are two things that any CPU designer would likely agree scream “this is CISC!” And AMD and x86 have both of them:

1) to decode the instructions, you need a state machine. In other words, you have to take multiple clock cycles to decode the instruction, because you have to make multiple passes over the instruction stream to figure out what is going on, convert the instructions to different kinds of instructions, detect dependencies, etc.

2) any arbitrary instruction can access memory. In RISC, only LOAD and STORE instructions (and maybe one or two others depending on the architecture) can access memory. This greatly simplifies what is necessary for dealing with register mapping, inter-instruction dependencies for scheduling, etc.

Thank you cmaier, thanks for responding so soon, and thanks for letting me pick your brain, sorry, now that I got you, a few more!

Not to worry, this is my last follow-up, (but it's a doozy)

I knew there's diminishing returns on transistor count, for instance, an ant our size couldn't support it's own weight, it's apparent mythical strength wouldn't/couldn't scale.

And I did understand the added transistors are for other functions, but AMD is also soc, and Intel mobile too, I'd be certain they'll find performance performance add-ons with the 60 percent more transistors, so I don't see that as special in the m1, the transistor count helps Apple, it will also help other houses.

I'd guess that's what you would do if AMD asked again, I'd hope they have others thinking the same, so the diminished transistor count return is going to be offset with performance addons that use the transistors just as on the Apple arm.

For cache, I speculated a while ago, instead of adding cores, we'd have to go to actual separate processors so we could say double cache without diminishing returns with two (or whatever number) separate multi core cpus to effectively increase cache?

For the amount of cores, I'd suspect the actual diminished return would be the amount of processes running so each process get's it's own quanta whenever requested without even a hit the scheduler, pretty much everything could run real time, especially if enough ram to load whatever you got?

And, while back when I read articles comparing cisc vs risc, there was a memory premium for risc, obviously that's not the case in the Apple cpu, was that a mistaken liability?

Why isn't there smt in Apple arm?

And finally, I don't get the "4 performance and 4 efficient core" thingy, why not just individually clock down whichever cores you want? What's the deal with this?

Thanks, that's all I got, I understand if you don't want to dive into any or all of these cmaier

cmaier · Mar 12, 2021

perris said:
Thank you cmaier, thanks for responding so soon, and thanks for letting me pick your brain, sorry, now that I got you, a few more!

Not to worry, this is my last follow-up, (but it's a doozy)

I knew there's diminishing returns on transistor count, for instance, an ant our size couldn't support it's own weight, it's apparent mythical strength wouldn't/couldn't scale.

And I did understand the added transistors are for other functions, but AMD is also soc, and Intel mobile too, I'd be certain they'll find performance add-ons with the added transistors, so I don't see that as special in the m1.

I'd guess that's what you would do if AMD asked again, I'd hope they have others thinking the same, so the diminished return is going to be offset just as on the Apple arm.

For cache, I speculated a while ago, instead of adding cores, we'd have to go to actual separate processors so we could say double cache without diminishing returns with two (or whatever number) separate multi core cpus to effectively increase cache?

For the amount of cores, I'd suspect the actual diminished return would be the amount of processes running so each process get's it's own quanta whenever requested without even a hit the scheduler, pretty much everything could run real time, especially if enough ram to load whatever you got?

And, while back when I read articles comparing cisc vs risc, there was a memory premium for risc, obviously that's not the case in the Apple cpu, was that a mistaken liability?

Why isn't there smt in Apple arm?

And finally, I don't get the "4 performance and 4 efficient core" thingy, why not just individually clock down whichever cores you want? What's the deal with this?

Thanks, that's all I got, I understand if you don't want to dive into any or all of these cmaier

Many questions, and I might not understand them all, so feel free to follow up.

On the memory premium, the issue is instruction memory. Remember, when you are using a computer, some of the memory represents the instructions, and some of it represents the data. On a CISC machine, you can usually represent a given stream of instructions using fewer bits, so you use less memory for instructions. (I saw an academic paper that said x86 uses about 10% less than Arm, but it can vary quite a bit depending on the situation). This is sort of a false problem, though. For almost any user, the amount of memory used by data will dwarf the amount of memory used for instructions. If memory was cheap back in the days of the 8080 and Z80, we may not ever have had CISC, because there’s really no point in trying to conserve instruction memory at this point. You do need to increase the size of the L0 instruction cache a little bit for RISC, but in the grand scheme of things it‘s a non-issue.

With respect to diminishing return on cores, it’s an issue both of threads and processes. In some cases multiple cores are useful because you are doing a lot of jobs at once. In other cases multiple cores are useful because you are doing a single job, but the job can be split into parallel pieces that can run simultaneously. In both cases there is diminishing return, but where that happens depends on the specific situation.

As for AMD, they are doing this “chiplet” thing, and they do some SoC work, but as far as I know they haven’t done much beyond graphics as the additional function. Doing more than that (machine learning, etc.) is a bit trickier for them, because they have to convince operating system vendors to use whatever extensions they dream up. And Intel will likely come up with different ones. This was always a problem for us at AMD, and it was a miracle we got Microsoft to go along with AMD64 (now x86-64).

As for SMT - I’ve said this before, but I’ll repeat. When I see that a processor has SMT, that tells me that it is a poor design, because it has too many resources that can’t be kept busy without it. The ideal design is to add cores for each thread you think you’ll need, and to design each core so that it doesn’t have extra resources that can’t be kept busy without SMT. SMT adds a lot of overhead, as well. And the SMT penalty would be larger for Arm, because Arm has more registers that need to be swapped to memory when the thread switches.

As for cache, you have multiple levels, and there are an infinite number of ways to organize them. You typically have a “system cache” that can be shared between the cores, and each core has its own level 1 cache. The level 2 cache could be a shared cache or can be individual. So many ways to do it. My Ph.D. project involved cache architecture, and I did thousands of simulations to figure out what would be the best way to do it based on the particular job we wanted to do with the processor.

As for the efficiency cores, they use less power to perform at equivalent performance than powering down a performance core. They are also smaller, so you can fit 4 of them in the same space as 1 or 2 performance cores, typically. For example, performance cores are designed to perform any instruction, including, say, integer multiplications. The integer multiplier is a very big circuit. It’s not used very often. So you pull it out of the efficiency cores, and you send instruction streams to those cores if they don’t do a lot of multiplies. If a rare multiply shows up, you can always simulate it using other operations (like shifts and adds). Even more so with integer division. ANother example is that the high performance has many ALUs that can operate in parallel, and instructions can be issued out of order if they don’t depend on each other. This means you have to have a scheduler, register renaming/reservation stations, unit, etc. The Efficiency cores can get rid of all that, and just run instructions through one at a time, in series. Every once in awhile there will be a penalty because an instruction will have to wait for another instruction to finish instead of just going ahead, but you don’t care about that for certain threads because they don’t need to run very fast.

perris · Mar 12, 2021

cmaier said:
Many questions, and I might not understand them all, so feel free to follow up.

On the memory premium, the issue is instruction memory. Remember, when you are using a computer, some of the memory represents the instructions, and some of it represents the data. On a CISC machine, you can usually represent a given stream of instructions using fewer bits, so you use less memory for instructions. (I saw an academic paper that said x86 uses about 10% less than Arm, but it can vary quite a bit depending on the situation). This is sort of a false problem, though. For almost any user, the amount of memory used by data will dwarf the amount of memory used for instructions. If memory was cheap back in the days of the 8080 and Z80, we may not ever have had CISC, because there’s really no point in trying to conserve instruction memory at this point. You do need to increase the size of the L0 instruction cache a little bit for RISC, but in the grand scheme of things it‘s a non-issue.

With respect to diminishing return on cores, it’s an issue both of threads and processes. In some cases multiple cores are useful because you are doing a lot of jobs at once. In other cases multiple cores are useful because you are doing a single job, but the job can be split into parallel pieces that can run simultaneously. In both cases there is diminishing return, but where that happens depends on the specific situation.

As for AMD, they are doing this “chiplet” thing, and they do some SoC work, but as far as I know they haven’t done much beyond graphics as the additional function. Doing more than that (machine learning, etc.) is a bit trickier for them, because they have to convince operating system vendors to use whatever extensions they dream up. And Intel will likely come up with different ones. This was always a problem for us at AMD, and it was a miracle we got Microsoft to go along with AMD64 (now x86-64).

As for SMT - I’ve said this before, but I’ll repeat. When I see that a processor has SMT, that tells me that it is a poor design, because it has too many resources that can’t be kept busy without it. The ideal design is to add cores for each thread you think you’ll need, and to design each core so that it doesn’t have extra resources that can’t be kept busy without SMT. SMT adds a lot of overhead, as well. And the SMT penalty would be larger for Arm, because Arm has more registers that need to be swapped to memory when the thread switches.

As for cache, you have multiple levels, and there are an infinite number of ways to organize them. You typically have a “system cache” that can be shared between the cores, and each core has its own level 1 cache. The level 2 cache could be a shared cache or can be individual. So many ways to do it. My Ph.D. project involved cache architecture, and I did thousands of simulations to figure out what would be the best way to do it based on the particular job we wanted to do with the processor.

As for the efficiency cores, they use less power to perform at equivalent performance than powering down a performance core. They are also smaller, so you can fit 4 of them in the same space as 1 or 2 performance cores, typically. For example, performance cores are designed to perform any instruction, including, say, integer multiplications. The integer multiplier is a very big circuit. It’s not used very often. So you pull it out of the efficiency cores, and you send instruction streams to those cores if they don’t do a lot of multiplies. If a rare multiply shows up, you can always simulate it using other operations (like shifts and adds). Even more so with integer division. ANother example is that the high performance has many ALUs that can operate in parallel, and instructions can be issued out of order if they don’t depend on each other. This means you have to have a scheduler, register renaming/reservation stations, unit, etc. The Efficiency cores can get rid of all that, and just run instructions through one at a time, in series. Every once in awhile there will be a penalty because an instruction will have to wait for another instruction to finish instead of just going ahead, but you don’t care about that for certain threads because they don’t need to run very fast.

That answered everything but one question, I believe you understood everything I was thinking, all but one item

And thanks for the invitation to follow up, the question I didn't see an answer, (maybe I missed it), about multiple cpu's being able to get more cache per core vs all cores on one cpu, for the reason you stated at top, proximity is important for latency, so more cpu's instead of more cores, proximity problem disappears, so I guess you still run into latency with communication between cpus?

I didn't know smt has a price, I thought, why not just add the smt even if you have a proper design, I was thinking about the ryzen 4700u, 8 cores, with smt turned off, vs the 4800u 8 cores 16 threads, smt turned on, so I thought smt, while it might have a price, it's net positive?

and I didn't know the efficiency cores are smaller, that's the answer for a laptop, but not an answer for a desktop, though as you say, clocking down isn't as efficient, I hadn't known that either, thanks there too.

Heading to bed, I'll see this tomorrow, so thanks now!

cmaier · Mar 12, 2021

perris said:
That answered everything but one question, I believe you understood everything I was thinking, all but one item

And thanks for the invitation to follow up, the question I didn't see an answer, (maybe I missed it), about multiple cpu's being able to get more cache per core vs all cores on one cpu, for the reason you stated at top, proximity is important for latency, so more cpu's instead of more cores, proximity problem disappears, so I guess you still run into latency with communication between cpus?

I didn't know smt has a price, I thought, why not just add the smt even if you have a proper design, I was thinking about the ryzen 4700u, 8 cores, with smt turned off, vs the 4800u 8 cores 16 threads, smt turned on, so I thought smt, while it might have a price, it's net positive?

and I didn't know the efficiency cores are smaller, that's the answer for a laptop, but not an answer for a desktop, though as you say, clocking down isn't as efficient, I hadn't known that either, thanks there too.

Heading to bed, I'll see this tomorrow, so thanks now!

If you think multiple cpus is better because of cache, you can always just create two caches and put them on the same chip with two “cpus.” So I guess the answer is you can always play with giving multiple levels of cache, some shared, some shared by only certain cores, and get the same effect without paying the penalty of chip-to-chip communications.

As for SMT, it could be a net positive, but in my opinion the smarter design will always be more, simpler cores. The only reason to do SMT is if you can’t fit another core, or because you find that you often have ALUs sitting idle. But if you have ALUs sitting idle, that tends to show that maybe you have too many ALUs, or you have some bottleneck that is preventing them from being fed, and you should fix those problems instead.

crazy dave · Mar 12, 2021

cmaier said:
ANother example is that the high performance has many ALUs that can operate in parallel, and instructions can be issued out of order if they don’t depend on each other. This means you have to have a scheduler, register renaming/reservation stations, unit, etc. The Efficiency cores can get rid of all that, and just run instructions through one at a time, in series. Every once in awhile there will be a penalty because an instruction will have to wait for another instruction to finish instead of just going ahead, but you don’t care about that for certain threads because they don’t need to run very fast.

This is of course optional.

Icestorm cores are out-of-order and though they are big compared to in-order Arm 5x efficiency cores, they don’t end up using that much more power and are far more performant. True, they do pay a price in silicon area (I believe they are about 3-1 Icestorm to Firestorm while A5x are tiny), but Apple obviously thinks it is worth it. While firestorm is obviously the star of the show I’ve seen more than a few say that Icestorm cores are, in some ways, even more impressive given their performance and wattage.

The Tredmont/Goldmont atom efficiency) cores in Lakeview/Alder Lake are also out-of-order (with SMT too I think!). They’re said to have a unique decoder architecture with Skylake IPC in a much smaller size.

EDIT: Names are Tremont and Gracemont (Goldmont is older) and they don't have SMT.

cmaier · Mar 12, 2021

crazy dave said:
This is of course optional. Icestorm cores are out-of-order and though they are big compared to in-order Arm 5x efficiency cores, they don’t end up using that much more power and are far more performant. True, they do pay a price in silicon area (I believe they are about 3-1 Icestorm to Firestorm while A5x are tiny), but Apple obviously thinks it is worth it. While firestorm is obviously the star of the show I’ve seen more than a few say that Icestorm cores are, in some ways, even more impressive given their performance and wattage.

The Tredmont/Goldmont atom efficiency) cores in Lakeview/Alder Lake are also out-of-order (with SMT too I think!). They’re said to have a unique decoder architecture with Skylake IPC in a much smaller size.

Of course. The point is simply that the micro architecture of efficiency cores is going to be simpler in various ways than performance cores, otherwise you could just downclock.

Wasn’t aware that Lakeview efficiency cores have SMT. That’s dumb.

crazy dave · Mar 12, 2021

perris said:
I didn't know smt has a price, I thought, why not just add the smt even if you have a proper design, I was thinking about the ryzen 4700u, 8 cores, with smt turned off, vs the 4800u 8 cores 16 threads, smt turned on, so I thought smt, while it might have a price, it's net positive?

and I didn't know the efficiency cores are smaller, that's the answer for a laptop, but not an answer for a desktop, though as you say, clocking down isn't as efficient, I hadn't known that either, thanks there too.

Whither SMT2 is a difficult question and as @cmaier said, the tradeoff can be different for different architectures. For current x86 designs I'd say it is a net positive: it only has a cost if you use it as cores with SMT2 disabled/enabled get the same single thread benchmarks and it doesn't use much if any extra power even when on. But enabling SMT2 and running additional threads only nets you an additional 20% performance at best in full saturation, so no where near double. Some Power cores (distant cousins of PPC G3-G5’s) have SMT8 (Power8) with 8 threads per core and Ian at anandtech mentioned there are SMT64 cores with 64 threads per core! No idea how these behave. However SMT also poses additional security concerns.

Even for desktop, not all desktops have great cooling and using less electricity is always good for everyone but more to the point if you have a good OS/programming paradigm for your developer (like Apple does ... Windows/Linux ... maybe, well we're about to find out!) built on prioritizing threads and knowing when to migrate threads to efficiency cores you can, for a small silicon budget, ensure your performance cores are untouched by whatever background housekeeping is going on. That can be useful.

crazy dave · Mar 12, 2021

cmaier said:
Of course. The point is simply that the micro architecture of efficiency cores is going to be simpler in various ways than performance cores, otherwise you could just downclock.

Wasn’t aware that Lakeview efficiency cores have SMT. That’s dumb.

I think I'm wrong. I tried to look it up again and the efficiency cores don't seem to have SMT. My mistake. Also I got the names wrong. Banner post from me. 🤪

perris · Mar 13, 2021

cmaier said:
If you think multiple cpus is better because of cache, you can always just create two caches and put them on the same chip with two “cpus.” So I guess the answer is you can always play with giving multiple levels of cache, some shared, some shared by only certain cores, and get the same effect without paying the penalty of chip-to-chip communications.

As for SMT, it could be a net positive, but in my opinion the smarter design will always be more, simpler cores. The only reason to do SMT is if you can’t fit another core, or because you find that you often have ALUs sitting idle. But if you have ALUs sitting idle, that tends to show that maybe you have too many ALUs, or you have some bottleneck that is preventing them from being fed, and you should fix those problems instead.

Perfect, you've answered everything, thank you very much for the private tutoring cmaier!

The only reason to do SMT is if you can’t fit another core

Clearly a decent reason for smt in laptop, especially transistor limited on the soc, and of course the r/d/cost the added core, there's also some nice marketing effects of "8 cores 16 threads"

Such a good information source, thanks for this discussion, I'm good to go

ArPe · Mar 13, 2021

MacRumors said:
Engheim believes that Intel and AMD are in a tough spot because of the limitations of the CISC instruction set and their business models that don't make it easy to create end-to-end chip solutions for PC manufacturers.

They are RISC with CISC interpreters for backwards compatibility just to be accurate. Intel introduced the internal RISC core with Pentium Pro (P6).

perris · Mar 13, 2021

crazy dave said:
I think I'm wrong. I tried to look it up again and the efficiency cores don't seem to have SMT. My mistake. Also I got the names wrong. Banner post from me. 🤪

Crazy dave, thanks for your content too, it's really good having this read, great stuff, and thanks to everyone on the thread for the discussion

perris · Mar 13, 2021

ArPe said:
They are RISC with CISC interpreters for backwards compatibility just to be accurate. Intel introduced the internal RISC core with Pentium Pro (P6).

If there's that big a hit, I've wondered why the backward compatibility aren't simply tossed in x86 and given an optional emulation, even in the cloud, or possibly an app that's specifically vm already installed, could be a dongle or on pcie removable media working without user contribution just by plugging in or tapping, so no performance hit unless accessed

ArPe · Mar 13, 2021

perris said:
If there's that big a hit, I've wondered why the backward compatibility aren't simply tossed in x86 and given an optional emulation, even in the cloud, or possibly an app that's specifically vm already installed, could be a dongle or on pcie removable media working without user contribution just by plugging in or tapping, so no performance hit unless accessed

So many bits and pieces of Windows are 32 bit even on ARM. So many old apps continue to run natively or in comparability mode. Because Windows has such a big user base they don’t have the freedom that Apple has to break with the past very quickly.

09872738 · Mar 13, 2021

ArPe said:
They are RISC with CISC interpreters for backwards compatibility just to be accurate. Intel introduced the internal RISC core with Pentium Pro (P6).

I don‘t think that is technically accurate. Its more like Intel invented the moniker „internal“ RISC for marketing reasons.

ArPe · Mar 13, 2021

09872738 said:
I don‘t think that is technically accurate. Its more like Intel invented the moniker „internal“ RISC for marketing reasons.

yes and no, in the Pentium Pro IA-32 instructions were translated to buffered RISC like micro operations and still are today. It wasn’t just marketing. The micro ops are the same as RISC and outperformed fully RISC CPUs on a clock per clock basis. But it did lose out once DEC Alpha chips had much higher clock speed (200Mhz vs 500Mhz). But then, you could build a dual Pentium Pro system for cheaper than a DEC.

Developer Delves Into Reasons Why Apple's M1 Chip is So Fast

macrumors 6502

macrumors 68000

macrumors 68020

macrumors 68000

macrumors 6502a

Suspended

Cancelled

macrumors 6502

macrumors newbie

Suspended

macrumors newbie

Suspended

macrumors newbie

Suspended

macrumors 68000

Suspended

macrumors 68000

macrumors 68000

macrumors newbie

Suspended

macrumors newbie

macrumors newbie

Suspended

Cancelled

Suspended

Our Staff