Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

Apple Knowledge Navigator

macrumors 68040
Original poster
Mar 28, 2010
3,978
14,408
As someone who isn't a developer, I've been intrigued by how those who are need to update their Mac software for the various Apple Silicon chips.

Benchmarks have shown that software scales pretty well when going from M1 > Pro > Max, but when it comes to the Ultra more work on the developer's end is required. Software that has been optimised, such as Blender, has shown very good performance gains.

I was wondering if anyone in the know would be kind enough to explain what's involved in getting software optimised for M1 Ultra. Is it as simple as clicking some tickboxes before recompiling? Or does the developer need to put time into rewriting parts of the code?

Thanks! ;)
 
  • Haha
Reactions: alien3dx
Perhaps naively, I assumed when the M1 Ultra was announced that anything which used multi cores/threads (or whatever the correct term is) would automatically 'just work' and take full advantage of the Ultra.

It does appear that this isn't the case.
 
here is the documentation for porting apps to apple silicon. It’s certainly not a trivial checkbox task.


Thanks, I had no idea it was that taxing.

Add on the fact that Rosetta 2 is so good, and it’s easy to see why developers just don’t bother

If the Ultra poses a challenge to fully utilize then will the M2 “Extreme” be equally or worse under utilized, pro software wise?
 
Note that getting apps to run on many cores is not a challenge specific to AS. You can see it in x86 versions of multi-core apps as well. For essentially all apps, as you increase core count, you will eventually get to the point of diminishing returns.

So I think the more interesting question might be: Consider those apps that have native ports to both x86 and AS, and are multi-core capable. Among these, which are the ones for which there is a significant difference between the two ports in how many cores they can run well on?
 
Thanks, I had no idea it was that taxing.

Looking at the docs, it's not that taxing.

Then again, I'm a MacOS developer that started working on iOS, bringing across best practices 8 years ago and have the transition to ARC and 64 bit - as well as seeing the struggles going from swift 1 through 5. The changes to GCD and attributing the task to efficiency or performance cores is straight forward. While audio app developers may have a bigger task ahead, the transition is straight forward.

If developers keep their code up to date the transition will be much easier than someone with legacy code that hasn't been updated for sometime.
 
Looking at the docs, it's not that taxing.

Then again, I'm a MacOS developer that started working on iOS, bringing across best practices 8 years ago and have the transition to ARC and 64 bit - as well as seeing the struggles going from swift 1 through 5. The changes to GCD and attributing the task to efficiency or performance cores is straight forward. While audio app developers may have a bigger task ahead, the transition is straight forward.

If developers keep their code up to date the transition will be much easier than someone with legacy code that hasn't been updated for sometime.
So in simple terms, if an app is already compiled for AS and runs well on the multi-core of M1, Pro and Max, why does it require more work for Ultra?
 
So in simple terms, if an app is already compiled for AS and runs well on the multi-core of M1, Pro and Max, why does it require more work for Ultra?
Not every task and not every algorithm scales perfectly with more cores. This is especially true with CPU cores more than GPU. Apple recommends splitting your task into at least 3 times the total number of cores. With 20 cores it just might not be possible to do that.
 
Not every task and not every algorithm scales perfectly with more cores. This is especially true with CPU cores more than GPU. Apple recommends splitting your task into at least 3 times the total number of cores. With 20 cores it just might not be possible to do that.
That’s interesting, thank you!
 
Thats not interesting but rather nonsense. With a scalable algorithm you typically create as many threads as you have HW threads. The OS scheduler will take care of everything else.
Apple disagrees with you.

Tuning your code’s performance for Apple silicon

Manage Parallel-Computation Tasks Efficiently​

One way to improve performance is to divide a problem into multiple pieces and execute those pieces in parallel on the available cores. Use this approach for large tasks that require significant processing resources. For example, use it to divide an image into multiple pieces and process those pieces in parallel.

In GCD, the concurrentPerform(iterations:execute:) function takes a provided block and calls it multiple times on the available system cores. This function uses a work-stealing algorithm to keep each core busy with work and is an efficient way to process large tasks in parallel. On Apple silicon, the algorithm distributes work efficiently to both p-cores and e-cores, adjusting the distribution of tasks dynamically as needed. To ensure the maximum benefit of this algorithm, make the number of iterations at least three times the total number of cores on the system. The system needs enough iterations to ensure appropriate distribution of the tasks across different types of cores.
 
  • Like
Reactions: gradi and diamond.g
So in simple terms, if an app is already compiled for AS and runs well on the multi-core of M1, Pro and Max, why does it require more work for Ultra?
There are three main areas: CPU, GPU and various accelerators, esp. for video. For each of those categories the Ultra has more.

CPU: Multi-core CPU scalability is a classic problem often described as "Amdahl's Law". A tiny fraction of serializable code will rapidly limit multi-core scalability. You might think if an algorithm is 75% parallel that's good. In fact that will mean going from 10 cores to 20 cores will have limited improvement. That is not unique to Apple Silicon but applies on any multi-core CPU. So Ultra (or any high-core-count CPU) requires more work from the programmer to effectively harness the available cores: https://en.wikipedia.org/wiki/Amdahl's_law

GPU: In theory GPU should be the most scalable since the hardware and software was designed from the outset for parallel use. However as GPU core counts increase more attention is needed to obtain best scalability. This was discussed in the WWDC20 talk 10159, "Scale compute workloads across Apple GPUs". Areas that might need attention are serialization between CPU and GPU, improving memory locality, and monitoring new XCode performance counters such as MMU TLB Miss Rate: https://developer.apple.com/videos/play/wwdc2022/10159/

Video accelerators: The Ultra has double the number of video accelerators as Max: 2 decode engines, 4 encode engines and 4 ProRes encode/decode engines. For the programmer to harness that full capability in parallel would require significant work, and for some codecs it might not be possible.

Right now Apple Compressor can automatically segment a ProRes input file to separate fragments and use multiple processes (each theoretically using a separate accelerator), then it concatenates the segments for final output. However there is currently limited scalability from this. It is not bottlenecked on CPU, GPU or I/O, so there is some internal limit.

For Long GOP codecs in theory those using independent or closed GOPs (which don't reference others) could possibly be segmented and likewise processed in parallel. The accelerator hardware would not know the difference. However anytime you split an interframe codec it is risky. That is essentially equivalent to utilities that do "no encode" trimming, which historically has been problematic for Long GOP formats.

Already it appears the multiple decoders on M1 Ultra are being used for multi-stream input, e.g, multicam. Unlike the single-file case, decoding multiple input files in parallel avoids the Long GOP segmentation risk. FCP is significantly smoother on M1 Ultra than M1 Max when reading Four-camera multicam composed of 4k H264. Both of those cases are not limited by CPU, GPU or I/O, so that implies the additional video decoders are being used.

It's a complex area of finding bottlenecks, examining what causes each one then assessing what is needed to improve it. With the end of Dennard Scaling around 2006 (https://en.wikipedia.org/wiki/Dennard_scaling), each CPU core cannot keep getting faster from clock rate. That leaves more incremental per-core improvements from IPC, but mainly programmers must learn to use increased parallelism and specific hardware accelerators. Software methods that scaled OK when going from 4 to 8 CPU cores may require more stringent study to go beyond that. Likewise prior methods which worked OK for a lower-end GPU may not scale to larger ones, so the programmer must study and optimize this.

Traditionally there has only been a single video accelerator which was shared among all CPU cores. E.g, for most Intel CPUs there is only a single Quick Sync accelerator that's shared among all cores. That implies a simple programming model where all threads serialize waiting on Quick Sync. If after years of programmers doing it that way, multiple accelerators become available, a lot more thought is required to use those in parallel without concurrency bugs.
 
There are three main areas: CPU, GPU and various accelerators, esp. for video. For each of those categories the Ultra has more.

CPU: Multi-core CPU scalability is a classic problem often described as "Amdahl's Law". A tiny fraction of serializable code will rapidly limit multi-core scalability. You might think if an algorithm is 75% parallel that's good. In fact that will mean going from 10 cores to 20 cores will have limited improvement. That is not unique to Apple Silicon but applies on any multi-core CPU. So Ultra (or any high-core-count CPU) requires more work from the programmer to effectively harness the available cores: https://en.wikipedia.org/wiki/Amdahl's_law

GPU: In theory GPU should be the most scalable since the hardware and software was designed from the outset for parallel use. However as GPU core counts increase more attention is needed to obtain best scalability. This was discussed in the WWDC20 talk 10159, "Scale compute workloads across Apple GPUs". Areas that might need attention are serialization between CPU and GPU, improving memory locality, and monitoring new XCode performance counters such as MMU TLB Miss Rate: https://developer.apple.com/videos/play/wwdc2022/10159/

Video accelerators: The Ultra has double the number of video accelerators as Max: 2 decode engines, 4 encode engines and 4 ProRes encode/decode engines. For the programmer to harness that full capability in parallel would require significant work, and for some codecs it might not be possible.

Right now Apple Compressor can automatically segment a ProRes input file to separate fragments and use multiple processes (each theoretically using a separate accelerator), then it concatenates the segments for final output. However there is currently limited scalability from this. It is not bottlenecked on CPU, GPU or I/O, so there is some internal limit.

For Long GOP codecs in theory those using independent or closed GOPs (which don't reference others) could possibly be segmented and likewise processed in parallel. The accelerator hardware would not know the difference. However anytime you split an interframe codec it is risky. That is essentially equivalent to utilities that do "no encode" trimming, which historically has been problematic for Long GOP formats.

Already it appears the multiple decoders on M1 Ultra are being used for multi-stream input, e.g, multicam. Unlike the single-file case, decoding multiple input files in parallel avoids the Long GOP segmentation risk. FCP is significantly smoother on M1 Ultra than M1 Max when reading Four-camera multicam composed of 4k H264. Both of those cases are not limited by CPU, GPU or I/O, so that implies the additional video decoders are being used.

It's a complex area of finding bottlenecks, examining what causes each one then assessing what is needed to improve it. With the end of Dennard Scaling around 2006 (https://en.wikipedia.org/wiki/Dennard_scaling), each CPU core cannot keep getting faster from clock rate. That leaves more incremental per-core improvements from IPC, but mainly programmers must learn to use increased parallelism and specific hardware accelerators. Software methods that scaled OK when going from 4 to 8 CPU cores may require more stringent study to go beyond that. Likewise prior methods which worked OK for a lower-end GPU may not scale to larger ones, so the programmer must study and optimize this.

Traditionally there has only been a single video accelerator which was shared among all CPU cores. E.g, for most Intel CPUs there is only a single Quick Sync accelerator that's shared among all cores. That implies a simple programming model where all threads serialize waiting on Quick Sync. If after years of programmers doing it that way, multiple accelerators become available, a lot more thought is required to use those in parallel without concurrency bugs.
Thank you very much for the reply, they’re greatly appreciated.

Does the fact that Ultra is reliant on fusing two SoCs together a bottleneck in development, or does it not impact the developer beyond simply the increased number of cores?
 
Not every task and not every algorithm scales perfectly with more cores. This is especially true with CPU cores more than GPU. Apple recommends splitting your task into at least 3 times the total number of cores. With 20 cores it just might not be possible to do that.
Which makes somewhat clear that simply adding more cores may not be a useful use of money or effort, since the returns are rapidly-diminishing.
 
  • Like
Reactions: NoBoMac and jdb8167
Thank you very much for the reply, they’re greatly appreciated.

Does the fact that Ultra is reliant on fusing two SoCs together a bottleneck in development, or does it not impact the developer beyond simply the increased number of cores?
It might bottleneck the CPU and/or GPU cores but it shouldn't affect the concurrency model that the programmer develops against. Apple recommends using Grand Central Dispatch (GCD) to get the most balanced use of CPU cores (as I noted above). That library is optimized for Apple silicon and the M1 Ultra.

As you can see from @Gerdi's response above, not all developers are going to know to use GCD and might try to create their own thread pools. That should still work but it will take a lot more effort on the developers part and probably would need to be optimized for each SoC (at least somewhat.) Developers should be lazy and just use Apple's GCD concurrency solution and the Ultra will probably scale just fine without additional developer work.
 
As you can see from @Gerdi's response above, not all developers are going to know to use GCD and might try to create their own thread pools. That should still work but it will take a lot more effort on the developers part and probably would need to be optimized for each SoC (at least somewhat.) Developers should be lazy and just use Apple's GCD concurrency solution and the Ultra will probably scale just fine without additional developer work.
I think it's the opposite. The custom thread pool (or another platform-independent thread pool) probably already exists, because software is cross-platform by default. You generally want to avoid writing platform-specific code, because it usually means a lot of work for minimal gains. Especially if the platform-specific approach uses different logic than your existing platform-independent solution, forcing you to write an imperfect translation layer.
 
  • Like
Reactions: jdb8167
I think it's the opposite. The custom thread pool (or another platform-independent thread pool) probably already exists, because software is cross-platform by default. You generally want to avoid writing platform-specific code, because it usually means a lot of work for minimal gains. Especially if the platform-specific approach uses different logic than your existing platform-independent solution, forcing you to write an imperfect translation layer.
But then you lose all of Apples optizations. You are describing a fundamental problem with cross platform code and a likely reason a lot of software running on the M1 Ultra seems to be unoptimized.

It’s going to be a problem with Intel Alder Lake going forward for the same reason. Optimizing for asymmetric CPUs isn’t something most developers have any experience with.
 
I'm actually working on a particle system that I'm running on my M1 Max right now, which uses the concurrentPerform function you guys are arguing about :D

So a little experiment of how long it takes a particle simulation with a million particles to update 200 times while varying the iteration count:

1 iteration - 11.361510995 seconds


6 iterations - 7.644108836 seconds


7 iterations - 3.45511783 seconds


8 iterations - 3.099464963 seconds


9 iterations - 4.392617005 seconds


10 iterations - 4.010208301 seconds


20 iterations - 4.346037659 seconds


30 iterations - 4.226957578 seconds


50 iterations - 4.227287169 seconds


80 iterations - 4.167488548 seconds


As you can see, in this particular instance Gerdi is correct and using exactly 8 iterations (the same number of performance cores in my M1 Max) is the fastest.

That said this is only because the tasks I'm performing are almost identically load-balanced (If I have 10 iterations, that will be exactly 100,000 particles per thread, and the "simulation" treats every particle identically so all tasks will probably finish at the same time). If the tasks weren't load balanced then concurrentPerform's work stealing algorithm would kick in and you'd need more tasks in order for them to be stolen by idle cores, and 3x the number of cores is probably a good starting value so jdb8167 is correct there.

As for the OP's question I don't really think that writing software for the Ultra is that much different for 99% of cases. I'm sure on the Ultra the "sweet spot" would be 16 iterations.
 
Apple certainly does not disagree. I just mixed tasks with threads - yet my statement about threads is correct. The fact that you need more tasks than threads is trivial.
That kind of matters when it comes to Apple's recommendations. You can implement your own thread pools and implement a work-stealing algorithm to try to keep the performance cores working but that isn't what Apple recommends for most applications.
On a Mac with Apple silicon, GCD takes into account the differences in core types, distributing tasks to the appropriate core type to get the needed performance or efficiency.

For most apps, GCD offers the best solution for executing tasks. Instead of custom thread pools, you use dispatch queues to schedule your tasks for execution. Dispatch queues support the execution of tasks either serially or concurrently.
So it isn't simply a case of creating threads equal in number to cores and letting the scheduler handle the rest. I'm sure the scheduler does the best job it can with the information it has but why wouldn't you want to use the solution optimized for Apple silicon? One reason is as @JouniS says above, because you have cross-platform code that you don't want to rewrite. The problem with that is it isn't likely your cross-platform thread pool is designed for asymmetrical cores.
If you implement parallel computations using your own thread pools, use you own work-stealing algorithm to distribute tasks dynamically. If you use a static distribution, threads running on P cores will finish much sooner than threads running on E cores. As with the GCD function, make the number of tasks greater than the number of cores to ensure a fair distribution of work.
Any or all of these problems might be the reason why an M1 Ultra doesn't scale as well as expected. I'm not sure why you claimed my explanation was nonsense since it came from Apple almost directly.
 
Last edited:
I'm actually working on a particle system that I'm running on my M1 Max right now, which uses the concurrentPerform function you guys are arguing about :D

So a little experiment of how long it takes a particle simulation with a million particles to update 200 times while varying the iteration count:

1 iteration - 11.361510995 seconds


6 iterations - 7.644108836 seconds


7 iterations - 3.45511783 seconds


8 iterations - 3.099464963 seconds


9 iterations - 4.392617005 seconds


10 iterations - 4.010208301 seconds


20 iterations - 4.346037659 seconds


30 iterations - 4.226957578 seconds


50 iterations - 4.227287169 seconds


80 iterations - 4.167488548 seconds


As you can see, in this particular instance Gerdi is correct and using exactly 8 iterations (the same number of performance cores in my M1 Max) is the fastest.

That said this is only because the tasks I'm performing are almost identically load-balanced (If I have 10 iterations, that will be exactly 100,000 particles per thread, and the "simulation" treats every particle identically so all tasks will probably finish at the same time). If the tasks weren't load balanced then concurrentPerform's work stealing algorithm would kick in and you'd need more tasks in order for them to be stolen by idle cores, and 3x the number of cores is probably a good starting value so jdb8167 is correct there.

As for the OP's question I don't really think that writing software for the Ultra is that much different for 99% of cases. I'm sure on the Ultra the "sweet spot" would be 16 iterations.
It would be interesting to see what would happen on an M2 since the efficiency cores are quite a bit faster than on the M1 family. It's possible that using all 4 efficiency cores would speed up your simulation quite a bit over just using 4 performance cores.
 
Thank you very much for the reply, they’re greatly appreciated.

Does the fact that Ultra is reliant on fusing two SoCs together a bottleneck in development, or does it not impact the developer beyond simply the increased number of cores?
For the most part I don't think the two fused SoCs have a major impact on software development, as compared to the same subsystems on a single die.

Re the discussions about Grand Central Dispatch, I believe the future direction is using Swift Concurrency which may avoid the "thread explosion" problem. This was discussed in the WWDC21 talk "Swift concurrency: Behind the scenes": https://developer.apple.com/videos/play/wwdc2021/10254/

None of this is unique to M1 Ultra but the more cores and the more accelerators, the more concurrent programming is needed -- whether the same # of cores and accelerators are distributed across one die or multiple dies. That in turn requires more threads, and more coordination to avoid serialization. There is always risk one thread could get stuck on a pending operation and if GCD automatically spins up another thread that could cause excessive threads, each of which has overhead.

Re the talk about Ultra didn't have "the expected scalability": expected by whom? Anyone familiar with the underlying issues for the past 40 years knows that scalability is difficult across many processors. It's easy to get good scalability when going from 1 to 2 CPUs, and difficult at higher numbers. As supercomputer pioneer Seymour Cray said decades ago: "If you were plowing a field, would you rather use two strong oxen or 1024 chickens?". Unfortunately there is a limit to how strong a single CPU core can be, so programmers must increasingly become "chicken wranglers".
 
  • Like
Reactions: jdb8167
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.