Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

thedocbwarren

macrumors 6502
Original poster
Nov 10, 2017
430
378
San Francisco, CA
I've been super fascinated on how Apple is managing the efficiency cores. I've thrown some huge multi-core workloads at this thing and seen the 8 cores light up and happily (very desktop-like performance) blow through the task. Mostly I see some feathering here and there on the first 4 (assuming these are high performance.) Sometimes I'll see activity on last 4. Interesting thing as well is with Parallels Tech Preview it runs all the VM's activities on efficiency cores. I'm trying to get some idea what it determines efficiency and what not. Also, I get the sense these cores can scale up as well and operate at high-power when needed.

Anyone with more background on what's going on behind the scenes? I'm somewhat familiar with Arm, but normally I've seen them choose one over the other to run IOT devices. Ex., Tegra using 4 cores only and ignoring the low-power cores.
 
I don’t have a ton of background here, since Apple‘s documentation on this isn’t hugely in depth, but my understanding is that QoS levels along with some internal logic helps Apple decide what core will get a thread assigned to it. So if I assign QoS levels associated with background tasks to certain queues, they are much more likely to get assigned to the efficiency cores.


One advantage Apple has here is that both sets of cores run the exact same version of the ARM ISA. So a thread can be moved between them. This gives flexibility if it turns out that what was thought to be a light-weight UI thread starts doing heavy work, it can be moved to a power core if needed. I’ve read some issues where Samsung SoCs used slightly different ISA versions between the two cores, and so some threads would crash when moved to the efficiency cores. I think they have since fixed it.
 
  • Like
Reactions: CheesePuff
I don’t have a ton of background here, since Apple‘s documentation on this isn’t hugely in depth, but my understanding is that QoS levels along with some internal logic helps Apple decide what core will get a thread assigned to it. So if I assign QoS levels associated with background tasks to certain queues, they are much more likely to get assigned to the efficiency cores.


One advantage Apple has here is that both sets of cores run the exact same version of the ARM ISA. So a thread can be moved between them. This gives flexibility if it turns out that what was thought to be a light-weight UI thread starts doing heavy work, it can be moved to a power core if needed. I’ve read some issues where Samsung SoCs used slightly different ISA versions between the two cores, and so some threads would crash when moved to the efficiency cores. I think they have since fixed it.
This is very interesting. And you are right I've seen some other SoC devices use different ISAs for the split. One of my other devices was like this and it was seemingly buggy how that was handled.
 
Most importantly, Apple controls the dispatch mechanism and can assign certain OS level functions and APIs to run on efficiency cores in certain situations.
 
I've been super fascinated on how Apple is managing the efficiency cores. I've thrown some huge multi-core workloads at this thing and seen the 8 cores light up and happily (very desktop-like performance) blow through the task. Mostly I see some feathering here and there on the first 4 (assuming these are high performance.) Sometimes I'll see activity on last 4. Interesting thing as well is with Parallels Tech Preview it runs all the VM's activities on efficiency cores. I'm trying to get some idea what it determines efficiency and what not. Also, I get the sense these cores can scale up as well and operate at high-power when needed.

Anyone with more background on what's going on behind the scenes? I'm somewhat familiar with Arm, but normally I've seen them choose one over the other to run IOT devices. Ex., Tegra using 4 cores only and ignoring the low-power cores.
First 4 cores are the E-Cluster, last 4 make up the P-Cluster. sudo powermetrics --samplers cpu_power provides a breakdown with power & frequency stats
 
Most importantly, Apple controls the dispatch mechanism and can assign certain OS level functions and APIs to run on efficiency cores in certain situations.
For sure, but it's easier on Apple to use the same QoS mechanisms 3rd party devs have access to. Most of the background services should be using queues with utility/background level QoS anyways, which will push them towards the efficiency cores that way easily enough.

This is very interesting. And you are right I've seen some other SoC devices use different ISAs for the split. One of my other devices was like this and it was seemingly buggy how that was handled.

Yeah, it really pays to get the details right here. But if you do, the ability to assign any thread to any core does give you flexibility, at the cost of some complexity in the scheduler.

I kinda wish there was more detail floating around about the A14/M1 scheduler. Last time I worked on a team that did OS-level code like this, our approach was things like timer coalescing and minimizing "time to idle" to make the battery last longer. Now that asymmetric multi processing is here, it adds another lever to be able to balance between responsiveness and power consumption. I'd love to see what has been learned about scheduling on these sort of setups. Especially if there's a breaking point where it might be better to wake up a performance core to finish a task faster and put the CPU back to sleep, instead of always using efficiency cores when trying to save battery.
 
I'm trying to get some idea what it determines efficiency and what not. Also, I get the sense these cores can scale up as well and operate at high-power when needed.

We don't really have that much information on how Apple schedules multicore work. Given that they have had tons of experience with asymmetric designs, I'd wager that they have details pretty much worked out. They are likely to use a complex series of heuristics.

Things we know for certain (or that are obvious):

- Efficiency cores have very different internal design than performant cores, they feature less execution units, smaller caches and ultimately only a fraction of available performance, even though they are still quite capable OOO designs compared to the mainstream ARM chips

- There are some differences in physical capabilities: efficiency cores lack the memory ordering switch and thus cannot be used to run Rosetta2-translated apps. It is also unclear whether the efficiency cores have AMX support

- If you are running multiple active (resource-intensive) threads, Apple will utilize all available cores. Performance cores get priority - work is first scheduled on performance cores and then on efficiency cores. So if you have only few active threads, they will generally run on performance cores. Apple CPUs can quickly switch threads between different types of cores, so you can get optimal performance in every case

- If you thread is marked as low priority, it will probably run on an efficiency core.
 
Using Activity Monitor, the first 4 cores are the efficiency cores I believe.
 
This is probably a silly question, but............ :)

Is there some particular reason why you'd build say 4 high performance and 4 low performance cores, as opposed to perhaps 8 high performance cores and then have the ability to ramp any/all of them up and down in speed as needed?
Would this be something to do with fitting it onto the silicon, or would such a concept simply not work?
Just wondering.
 
This is probably a silly question, but............ :)

Is there some particular reason why you'd build say 4 high performance and 4 low performance cores, as opposed to perhaps 8 high performance cores and then have the ability to ramp any/all of them up and down in speed as needed?
Would this be something to do with fitting it onto the silicon, or would such a concept simply not work?
Just wondering.
They aren’t low performance cores they are high efficiency cores. In an ideal world it would be great if all eight cores were symmetrical and able to ramp up and down between high performance and high efficiency. The reality is that there are design trade offs that make that impractical. The efficiency cores use less silicon die space for example.

Designing for low power usage is something that Arm cores are traditionally good at. That makes designing asymmetrical cores a good trade off. What Apple has done with the M1 allows the best design for both performance and efficiency given a power and heat budget.
 
  • Like
Reactions: RPi-AS and Santiago
This is probably a silly question, but............ :)

Is there some particular reason why you'd build say 4 high performance and 4 low performance cores, as opposed to perhaps 8 high performance cores and then have the ability to ramp any/all of them up and down in speed as needed?
Would this be something to do with fitting it onto the silicon, or would such a concept simply not work?
Just wondering.

It’s as @jdb8167 says above - it’s very difficult to design a CPU core that would excel at both high performance and high efficiency. Energy efficiency in x86 is currently achieved exactly the way you describe - the CPU is slowed down when high performance is not needed. Still, you’d have to slow down a high-performance core a lot to get to really low power usage, and even then there is some constant power usage you can’t eliminate.

Then there are processors designed for power efficiency, like the Intel Atom. They are often “simpler”, have less execution units, less cache etc., so they can get to really low power consumption levels. Problem is - they are limited by their design from reaching high performance.

That’s why the modern paradigm is to use both: you have two type of cores in your CPU and shuffle execution around as is appropriate. It’s a much more complex design, with tricky management problems, but the benefits are substantial.

By the way, Intel is also embracing this technology. They already have one asymmetric CPU (Lakefield, and it sucks), and they are going full asymmetric with their next-gen CPU Adler Lake.
 
I think I’ll just add that the best guess I’ve seen is that the efficiency cores are about 20-25% of the performance of the power cores, but at just 10% of the power consumption. Yes, they aren’t as fast, but they still are on the order of 2x the performance per watt of the power cores.
 
I think I’ll just add that the best guess I’ve seen is that the efficiency cores are about 20-25% of the performance of the power cores, but at just 10% of the power consumption. Yes, they aren’t as fast, but they still are on the order of 2x the performance per watt of the power cores.
It would be neat to test that and see for sure. Even at 4 P cores and 4 E Cores this design works super well.
 
  • Like
Reactions: phill85
It would be neat to test that and see for sure. Even at 4 P cores and 4 E Cores this design works super well.

Yes, it would. Someone would need to create a benchmark that is aware of the asymmetric design, and then load all the cores down, taking a look at the difference between the 4 fastest and 4 slowest results. Can’t just load down one core, or you will very likely find your task on a power core.

The 10% power consumption number comes from Apple. I tried to put together an estimate based on the Geekbench numbers, but it’s quite possible my numbers are more of an “upper bound”. There’s a lot of uncertainty in my napkin math, sadly.
 
It has been tested. Ice storm is about 20-30% the performance of Firestorm but 10 times lower power consumption, just as @Krevnik says.

I totally forgot about AnandTech, to be honest. I really need to add them to my normal reading list.

I wasn’t actually expecting my napkin math to be that close, get lucky sometimes I guess. :)
 
I totally forgot about AnandTech, to be honest. I really need to add them to my normal reading list.

I wasn’t actually expecting my napkin math to be that close, get lucky sometimes I guess. :)
Awesome! It's great to see. I've seen the whole system get utilized under load and efficiency or not they work very well. This design should scale nicely when we get 12, 16, etc. core systems this year or next. I'm very content with the 13 and looking at something in the desktop range for the next round. So far enjoying helping porting and getting opensource software native.
 
I wasn’t actually expecting my napkin math to be that close, get lucky sometimes I guess. :)

Napkin math works more often than it doesn’t in my experience. It’s a natural way to marginalize over nuisance variables :)
 
I know this is an older thread, but Ars just posted a brief article on how QoS impacts scheduling on the M1, with links back to the original work. Looks like background QoS level is exclusively sent to the efficiency cores. The other QoS levels can be assigned to either set of cores.

Not a complete answer, but does provide confirmation that QoS is a part of it.

https://arstechnica.com/gadgets/202...-cpu-but-m1-macs-feel-even-faster-due-to-qos/
 
I don't expect this will be a very frequent pain point, but I'm setting up my new M1 Max machine, and right now the xcode install has been ongoing for nearly an hour. And with its completion nowhere in sight. I'd like to figure out if there is a way to *override* the OS's thread scheduling to allow installd to leverage P-cores when I really really want that to happen, and restore the default behavior afterwards. It's highly likely that the huge amount of compute being done is layers and layers of crypto/hash checks on the countless files in the xcode package (only based on anecdotal inspection done in the past when particularly annoyed by why installing software could chew up half the battery on an intel mac, it's the only explanation after all for why saving files to disk could be so heavy), and ordinarily making software install only run on E-cores is a great idea but not if I'm anxiously trying to set the computer up to be able to build software that I need to use the computer.

Anyone know if command line or other tools exist to allow user control over core affinity?

EDIT: I installed htop and used it to change installd's nice level to -20 and at first I thought it did nothing, but it started to offload some activity to P-cores!
 

Attachments

  • Screen Shot 2021-10-29 at 1.51.51 AM.png
    Screen Shot 2021-10-29 at 1.51.51 AM.png
    1 MB · Views: 124
Last edited:
  • Like
Reactions: TarkinDale
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.