Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

MacRumors

macrumors bot
Original poster
Apr 12, 2001
67,536
37,861


In collaboration with the Metal engineering team at Apple, PyTorch today announced that its open source machine learning framework will soon support GPU-accelerated model training on Apple silicon Macs powered by M1, M1 Pro, M1 Max, or M1 Ultra chips.

pytorch.jpg

Until now, PyTorch training on the Mac only leveraged the CPU, but an upcoming version will allow developers and researchers to take advantage of the integrated GPU in Apple silicon chips for "significantly faster" model training.

A preview build of PyTorch version 1.12 with GPU-accelerated training is available for Apple silicon Macs running macOS 12.3 or later with a native version of Python. Performance comparisons and additional details are available on PyTorch's website.

Article Link: Machine Learning Framework PyTorch Enabling GPU-Accelerated Training on Apple Silicon Macs
 
Last edited:
So unbelievably excited about this. Being able to test GPU-accelerated code locally (and even train some smaller networks) rather than having to rely on a unix-server will speed up ML development significantly.

Can't wait for it to become mature enough to be merged in a stable branch.
 
Nice step in the direction of finding out how capable these GPUs are for this kind of workload, may also give an idea if they work well enough for GPGPU and ML work to infer if Apple is likely to develop their own GPU for the eventual AS Mac Pro
 
So unbelievably excited about this. Being able to test GPU-accelerated code locally (and even train some smaller networks) rather than having to rely on a unix-server will speed up ML development significantly.

Can't wait for it to become mature enough to be merged in a stable branch.
I assume you are a practitioner. Can you answer the following question:
It *seems* that, even on the M1, all the major ML frameworks are bifurcating into infer on the Apple NPU, and train on the Apple GPU. So what's the role of AMX (introduced for the purposes of ML, and with the first few patent updates very much ML-focussed) in all this?

Answers I could imagine (but I don't know):
- AMX is useful in the exploratory phases of designing a network, where you don't know what you're doing or how it will work, and 32b, even 64b accuracy, is helpful and easy. Once the design is stabilized, you know how much and where you can move it to 16b or even 8b and GPU is better.

- we run the first phases of training on the GPU to get close to convergence, then the final phases at higher accuracy on AMX. (I've suggested doing this, in a different context, for large matrix manipulation like eigenvalues or solving elliptical PDEs)

- neural networks have many layers, and some of them are best handled on a GPU, some on AMX, some even on a CPU?

- or perhaps AMX is no longer really part of ML anymore?
(Which is not to say it's worthless! The latest patents to my eyes suggest it is being turned into a kinda super AVX512, still the matrix functionality but also a lot of the FP compute abilities of AVX512 without the downsides. So a great facility for math/sci/eng type use cases -- especially when it can become a direct compiler target. (Which I expect is coming. The current AMX design has been in such flux, so many new features added each year, and it's clear that the initial instruction set has not grown well into the new functionality! So I expect them at some point, when the pace of change slows down, to fix/redesign the instruction set and perhaps, at that point, expose it to the compiler.)
 
Nice! Just wish tensorflow had better support. Such a pain to use it on a Mac.
It looks easy to install: https://developer.apple.com/metal/tensorflow-plugin/

What’s painful about its use? I haven’t used tensorflow much but have started dabbling in it for research and would love to run it locally on my Mac from time to time. If it’s painful, I’ll stick with my current computer but would love to know what you’ve found challenging with it on an M1 Mac.
 
It looks easy to install: https://developer.apple.com/metal/tensorflow-plugin/

What’s painful about its use? I haven’t used tensorflow much but have started dabbling in it for research and would love to run it locally on my Mac from time to time. If it’s painful, I’ll stick with my current computer but would love to know what you’ve found challenging with it on an M1 Mac.
If you’re a newby it isn’t that easy. It’s not just pip install tensorflow. Plus you must use it in a conda environment. And it doesn’t appear to use the GPUs.
 
I assume you are a practitioner. Can you answer the following question:
It *seems* that, even on the M1, all the major ML frameworks are bifurcating into infer on the Apple NPU, and train on the Apple GPU. So what's the role of AMX (introduced for the purposes of ML, and with the first few patent updates very much ML-focussed) in all this?

Answers I could imagine (but I don't know):
- AMX is useful in the exploratory phases of designing a network, where you don't know what you're doing or how it will work, and 32b, even 64b accuracy, is helpful and easy. Once the design is stabilized, you know how much and where you can move it to 16b or even 8b and GPU is better.

- we run the first phases of training on the GPU to get close to convergence, then the final phases at higher accuracy on AMX. (I've suggested doing this, in a different context, for large matrix manipulation like eigenvalues or solving elliptical PDEs)

- neural networks have many layers, and some of them are best handled on a GPU, some on AMX, some even on a CPU?

- or perhaps AMX is no longer really part of ML anymore?
(Which is not to say it's worthless! The latest patents to my eyes suggest it is being turned into a kinda super AVX512, still the matrix functionality but also a lot of the FP compute abilities of AVX512 without the downsides. So a great facility for math/sci/eng type use cases -- especially when it can become a direct compiler target. (Which I expect is coming. The current AMX design has been in such flux, so many new features added each year, and it's clear that the initial instruction set has not grown well into the new functionality! So I expect them at some point, when the pace of change slows down, to fix/redesign the instruction set and perhaps, at that point, expose it to the compiler.)

I haven't looked specifically into how the MPS backend used for M1 devices works; on CUDA, PyTorch and other frameworks blend different cores (Tensor cores, generic CUDA cores), depending on the operations. They can also fuse some operators to get better performance. From what I've briefly read, since it leverage Metal APIs, I would assume that it also keep advantage of the AMX; if not, I'm sure that's in the pipeline.

In general, different parts of a platform (CPU, GPU, various accelerators) are transparently optimizing by the framework you are using. It terms of precision, training is done either with fp32 precision, or in mixed precision mode (fp32 + fp16); double floats are unnecessary. During inference, you might go down to even int8, but that requires advanced modeling tricks.

For what is worth, a lot of ML inference is still done on devices with no dedicated accelerator; so work in AVX512 optimization for ML (Intel MLK framework being a notable example) is an active area of interest for companies.
 
It could be really handy to be able to target "cpu", "cuda", or "mps", depending on the task/context!
 
I assume you are a practitioner. Can you answer the following question:
It *seems* that, even on the M1, all the major ML frameworks are bifurcating into infer on the Apple NPU, and train on the Apple GPU. So what's the role of AMX (introduced for the purposes of ML, and with the first few patent updates very much ML-focussed) in all this?

Answers I could imagine (but I don't know):
- AMX is useful in the exploratory phases of designing a network, where you don't know what you're doing or how it will work, and 32b, even 64b accuracy, is helpful and easy. Once the design is stabilized, you know how much and where you can move it to 16b or even 8b and GPU is better.

- we run the first phases of training on the GPU to get close to convergence, then the final phases at higher accuracy on AMX. (I've suggested doing this, in a different context, for large matrix manipulation like eigenvalues or solving elliptical PDEs)

- neural networks have many layers, and some of them are best handled on a GPU, some on AMX, some even on a CPU?

- or perhaps AMX is no longer really part of ML anymore?
(Which is not to say it's worthless! The latest patents to my eyes suggest it is being turned into a kinda super AVX512, still the matrix functionality but also a lot of the FP compute abilities of AVX512 without the downsides. So a great facility for math/sci/eng type use cases -- especially when it can become a direct compiler target. (Which I expect is coming. The current AMX design has been in such flux, so many new features added each year, and it's clear that the initial instruction set has not grown well into the new functionality! So I expect them at some point, when the pace of change slows down, to fix/redesign the instruction set and perhaps, at that point, expose it to the compiler.)
In my experience it depends on the network architecture and overhead. So for example we've seen LSTM/RNN do better on AMX in terms of speed vs GPU. Where as convolutions seem to be significantly less performant and more ideal for GPU/Neural Engine.
 
  • Like
Reactions: name99
I assume you are a practitioner. Can you answer the following question:
It *seems* that, even on the M1, all the major ML frameworks are bifurcating into infer on the Apple NPU, and train on the Apple GPU. So what's the role of AMX (introduced for the purposes of ML, and with the first few patent updates very much ML-focussed) in all this?

Answers I could imagine (but I don't know):
- AMX is useful in the exploratory phases of designing a network, where you don't know what you're doing or how it will work, and 32b, even 64b accuracy, is helpful and easy. Once the design is stabilized, you know how much and where you can move it to 16b or even 8b and GPU is better.

- we run the first phases of training on the GPU to get close to convergence, then the final phases at higher accuracy on AMX. (I've suggested doing this, in a different context, for large matrix manipulation like eigenvalues or solving elliptical PDEs)

- neural networks have many layers, and some of them are best handled on a GPU, some on AMX, some even on a CPU?

- or perhaps AMX is no longer really part of ML anymore?
(Which is not to say it's worthless! The latest patents to my eyes suggest it is being turned into a kinda super AVX512, still the matrix functionality but also a lot of the FP compute abilities of AVX512 without the downsides. So a great facility for math/sci/eng type use cases -- especially when it can become a direct compiler target. (Which I expect is coming. The current AMX design has been in such flux, so many new features added each year, and it's clear that the initial instruction set has not grown well into the new functionality! So I expect them at some point, when the pace of change slows down, to fix/redesign the instruction set and perhaps, at that point, expose it to the compiler.)
I did a lot of digging into the AMX myself! Its purpose seems to be accelerating ML for frameworks that don’t yet have GPU acceleration. This is stuff like MATLAB or the previous PyTorch when it only supported CPU. Also, the M1 family has killer FP64 performance for matrix multiplications because of the AMX.

PyTorch Forums Thread
 
Personally I'm hoping to see support for the neural engine in the future. When they added NE support to Topaz, my M1 Macbook Air suddenly started performing on par with my desktop GTX 1080.
The Neural engine isn't great for training, the ANE only has support for calculations with fp16, int16 and int8 all of which are too small to train with (too much instability). A common thing to do is train in fp32 to be able to get the small differences and gradients and then once the model is frozen do inference on fp16 or bf16.

It's great for inference though. But yeah, still very few apps make use of it.
 
This is amazing news! Up until now, Nvidia GPUs are the only game in town for us machine learners, which automatically means we can't use Macs for machine learning. Now that's going to change :D

Running a proper training of a many current neural networks to convergence will still only be feasible using cloud compute with multiple machines and multiple high-end Nvidia GPUs per machine, but for development, debugging and testing it's still incredibly useful to run stuff locally and for smaller models the Apple Silicon GPU cores might even be sufficient.
 
Last edited:
  • Like
Reactions: Imhotep397
This is amazing news! Up until now, Nvidia GPUs are the only game in town for us machine learners, which automatically means we can't use Macs for machine learning. Now that's going to change :D

Running a proper training of a many current neural networks to convergence will still only be feasible using cloud compute with multiple machines and multiple high-end Nvidia GPUs per machine, but for development, debugging and testing it's still incredibly useful to run stuff locally and for smaller models the Apple Silicon GPU cores might even be sufficient.
Not sure I’m following… when you say Nvidia is the only way, why don’t you consider Tensorflow-Metal an option? It’s directly from Apple. Do you expect PyTorch with Metal support to be much faster than Tensorflow with Metal? If so, why? Because you hope it will do the backwards graph offloading to GPU more efficiently than Tensorflow-Metal?
 
Not sure I’m following… when you say Nvidia is the only way, why don’t you consider Tensorflow-Metal an option? It’s directly from Apple. Do you expect PyTorch with Metal support to be much faster than Tensorflow with Metal? If so, why? Because you hope it will do the backwards graph offloading to GPU more efficiently than Tensorflow-Metal?

If you already have CUDA code I suspect most folks don't want to port it. Apple's falling out with nv made this an issue a long long time ago though, so this isnt exactly new
 
If you already have CUDA code I suspect most folks don't want to port it. Apple's falling out with nv made this an issue a long long time ago though, so this isnt exactly new
I'm aware, yes. I was merely wondering why he said Nvidia is the only option until PyTorch with Metal when Tensorflow with Metal support existed for a while. Nothing really changed, unless PyTorch is much faster now.
 
So unbelievably excited about this. Being able to test GPU-accelerated code locally (and even train some smaller networks) rather than having to rely on a unix-server will speed up ML development significantly.

Can't wait for it to become mature enough to be merged in a stable branch.


PyTorch's website only shows the highest end M1 Ultra tested (64-core GPU). Their improvements seem much more substantial than Sebastian's.

Possible explanations may be Mac Ultra's superior ventilation or superior GPU capabilities compared to the lower end M1s. I've got one of those lower end M1s and I'm still excited about this.

I've started tinkering with this stuff in free time and I'm as amateur as can be but I still look forward to seeing performance improvements. If the software build my boss and I are working on in free time can go anywhere commercially, it may be worth it to get a Mac Studio when the M2 Ultra comes out.
 
PyTorch's website only shows the highest end M1 Ultra tested (64-core GPU). Their improvements seem much more substantial than Sebastian's.
Are you referring to the M1 GPU vs CPU improvements or something else? The competition is not really the CPU version, but other dedicated GPUs, likely Nvidia.
Possible explanations may be Mac Ultra's superior ventilation or superior GPU capabilities compared to the lower end M1s.
There are M1 Ultra benchmarks on that site.
If the software build my boss and I are working on in free time can go anywhere commercially, it may be worth it to get a Mac Studio when the M2 Ultra comes out.
Time will tell. I doubt the M2 Ultra will be a huge jump, but ultimately Apple will need a real Mac Pro replacement, so that will be the one to aim for. I’ll keep an eye on things, maybe train a few smaller models on the Apple side, but so far it’s not even close to replace any of my Nvidia systems, not even the smallest one.
 
  • Like
Reactions: MacLC
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.