Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

leman

macrumors Core
Original poster
Oct 14, 2008
19,564
19,782
Apple has just released a framework for using Stable Diffusion models on Apple Silicon. This includes tools for converting the models to CoreML (Apple's ML framework) as well as some libraries to use these models. I haven't looked into this yet but they claim some fairly impressive performance gains, along with dramatic reduction in required RAM.

I hope someone will be quick to build an interface and a one-click installer for ease of use.

You can check it out here: https://machinelearning.apple.com/research/stable-diffusion-coreml-apple-silicon
 
  • Love
  • Like
Reactions: Tagbert and Xiao_Xi
Apple has chosen a peculiar hardware for benchmarking. It usually chooses best-in-class hardware.
stable-diffusion-coreml.png

  • The image generation procedure follows the standard configuration: 50 inference steps, 512x512 output image resolution, 77 text token sequence length, classifier-free guidance (batch size of 2 for unet).
  • Weights and activations are in float16 precision for both the GPU and the ANE.

It seems that the neural engine and the GPU don't generate the same image. I wonder which one is faster.
ne-gpu.png


 
  • Like
Reactions: gank41
Apple has chosen a peculiar hardware for benchmarking. It usually chooses best-in-class hardware.

They have benchmarks for a wide range of hardware. Choosing low-end machines for that table is probably deliberate to show off that image generation is viable even on base M1.
 
  • Like
Reactions: Tagbert
They have benchmarks for a wide range of hardware. Choosing low-end machines for that table is probably deliberate to show off that image generation is viable even on base M1.
You may be right, but this would be one of the few times that Apple doesn't use best-in-class hardware. For reference, for the benchmark in Pytorch's press release on Apple Silicon, Apple used a "production Mac Studio systems with Apple M1 Ultra, 20-core CPU, 64-core GPU 128GB of RAM, and 2TB SSD."

Which M1 MacBook Pro did Apple use for benchmarking? Why is it slower than a M2 MacBook Air?
 
You may be right, but this would be one of the few times that Apple doesn't use best-in-class hardware. For reference, for the benchmark in Pytorch's press release on Apple Silicon, Apple used a "production Mac Studio systems with Apple M1 Ultra, 20-core CPU, 64-core GPU 128GB of RAM, and 2TB SSD."

Which M1 MacBook Pro did Apple use for benchmarking? Why is it slower than a M2 MacBook Air?
Apple claims a 40% improvement with the M2 ANE vs the M1.
 
It seems that the neural engine and the GPU don't generate the same image. I wonder which one is faster.
benchmarks.png

  • In the benchmark table, we report the best performing --compute-unit and --attention-implementation values per device. The former does not modify the Core ML model and can be applied during runtime. The latter modifies the Core ML model. Note that the best performing compute unit is model version and hardware-specific.
 
  • Like
Reactions: jdb8167
Apple has chosen a peculiar hardware for benchmarking. It usually chooses best-in-class hardware.
Obviously, in this case Apple hardware is not in the same league as best-in-class hardware. I am surprised they decided to do any comparisons at all.
 
Obviously, in this case Apple hardware is not in the same league as best-in-class hardware. I am surprised they decided to do any comparisons at all.

Not at all. In fact, these results are pretty damn close to big desktop GPUs.
 
In fact, these results are pretty damn close to big desktop GPUs.
I don't consider almost twice as fast to be "pretty damn close". What GPU do you have in mind?

62plg1ina4s91.png


Are there any other benchmarks for the neural engine? I find it very interesting that M2's neural engine is as fast as M1's 24-core GPU in inference.
 
I don't consider almost twice as fast to be "pretty damn close". What GPU do you have in mind?

The RTX 3090 (in FP32 mode) needs 8 seconds, the M1 Ultra needs 9 seconds. I would say this is pretty damn close given the obvious performance and power usage disparity between these two GPUs. Sure, in half-precision mode the 3090 is twice as fast. But then the 3090 has FP16 tensor cores with the throughput of 285TFLOPs while the M1 Ultra is a 20TFLOPS GPU. Frankly, I find it really remarkable that an M2 can generate an image in under 25 seconds. That should be close to a (desktop) RTX 3050. Only of course M2 will have more useable RAM for these workloads...
 
The RTX 3090 (in FP32 mode) needs 8 seconds, the M1 Ultra needs 9 seconds.
Do you have any links to support these figures?

By the way, it doesn't look good when you have to cherry-pick the benchmark to make them look alike. Please, let's leave the cherry-picking of benchmarks to corporate marketing and loyal customers.
 
Do you have any links to support these figures?

The numbers are literally in the tables you have quoted in your posts.

By the way, it doesn't look good when you have to cherry-pick the benchmark to make them look alike. Please, let's leave the cherry-picking of benchmarks to corporate marketing and loyal customers.

How is what I wrote cherry-picking? I am simply commenting on the capabilities of the respective devices. It is absolutely clear that the 3090 is twice as fast, which is also what I wrote. But it is really impressive that Apple can deliver very good performance using this little power and despite such a huge nominal spec discrepancy.

Where this becomes relevant is when you look at the lower-end hardware. Even the base M2 is actually usable for this kind of work. It is a shame that the repo you quote only benchmarked high-end desktop GPUs. I'd love to see the results for the mobile RTX series.

P.S. the results also show that dual-rate FP16 and ML-specialized FP representations make a big difference. Apple GPUs could see a huge boost here if Apple implements these things in their GPU ALUs.
 
Last edited:
  • Like
Reactions: KingofGotham1
The numbers are literally in the tables you have quoted in your posts.
Apple did the benchmarks in FP16, not in FP32.

How is what I wrote cherry-picking?
To support your claim:
In fact, these results are pretty damn close to big desktop GPUs.
Of the two benchmarks (FP16 and FP32) you could choose, you chose the benchmark that makes Ultra perform similar to the RTX 3090.
The RTX 3090 (in FP32 mode) needs 8 seconds, the M1 Ultra needs 9 seconds. I would say this is pretty damn close

It's amazing what Apple has done with their neural engine, GPU and CoreML, but they don't need us to hype their performance.

By the way, this graph has older Nvidia GPUs.
chartthin.png

 
Apple did the benchmarks in FP16, not in FP32.

Apple GPUs do not have a separate faster data path for FP16. They perform FP16 and FP32 calculations at the same speed*. Nvidia GPUs can perform FP16 calculations two times faster as the FP32 calculations.

*the AMX and CPU can perform FP16 calculations faster, but I doubt that they contribute even 20% of the performance we see here

Of the two benchmarks (FP16 and FP32) you could choose, you chose the benchmark that makes Ultra perform similar to the RTX 3090.

Again, I was commenting on the device performance when performing comparable amount of work. Sorry if you thought that I was cherry-picking.

My point is that Apple GPUs are still very much young. The current iteration didn't change much from the mobile GPU in the iPhone. Other GPUs implement a bunch of tricks (dual FP16 rate, dual-issue instructions etc.) that Apple has not implemented yet, but would be relatively straightforward for them to add. What interests me here is the performance potential of an immature product compared to a very mature one that already implements every possible and impossible trick in the book.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.