Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

gl3lan

macrumors newbie
Original poster
May 11, 2020
19
22
In case anyone is interested, in ran a fairly simple MNIST benchmark (proposed here : https://github.com/apple/tensorflow_macos/issues/25) on my recently acquired M1 Pro MBP (16-core GPU, 16GB RAM). I installed Tensorflow using the following guide (https://developer.apple.com/metal/tensorflow-plugin/).

For reference, this benchmark seems to run at around 24ms/step on M1 GPU.

On the M1 Pro, the benchmark runs at between 11 and 12ms/step (twice the TFLOPs, twice as fast as an M1 chip).

The same benchmark run on an RTX-2080 (fp32 13.5 TFLOPS) gives 6ms/step and 8ms/step when run on a GeForce GTX Titan X (fp32 6.7 TFLOPs). A similar level of performance should be also expected on the M1 Max GPU (which should run twice as fast as the M1 Pro).

Of course, this benchmark runs a fairly simple CNN model but it already gives an idea. Keep also in mind that RTX generation cards are able to run faster at fp16 precision, I am not sure it would apply to Apple Silicon.

I would be happy to run any other benchmark if suggested (or help someone to run the benchmark on a M1 Max chip), even if I am more of a PyTorch guy. ;-)

[edit] Makes me wonder whether I should have gone for the M1 Max chip... probably not.
 
Last edited:
Are those desktop Nvidia cards you are comparing it to? If so, no bad at all for a compact laptop class.

But yeah, the memory bandwidth a d large caches make these machines ideal for data science. There has been some unusually high activity on PyTorch GitHub recently asking for a native M1 backend. There is a good chance that 2022 is the year when Apple takes the ML community by storm.
 
Are those desktop Nvidia cards you are comparing it to? If so, no bad at all for a compact laptop class.

But yeah, the memory bandwidth a d large caches make these machines ideal for data science. There has been some unusually high activity on PyTorch GitHub recently asking for a native M1 backend. There is a good chance that 2022 is the year when Apple takes the ML community by storm.
Getting 64GB of VRAM memory for "cheap" is huge.

Previously, you needed an $13k Nvidia A100 card for that.
 
How was it battery- / heat-wise? And curious how Pro compares to Max in terms of battery and heat.
Honestly it did not run long enough for me to evaluate battery-wise. Heat-wise, I have not heard the fan once since I have the machine, and the computer was slightly warm at some point (but orders of magnitudes cooler than my 2015 13 or my 2020 13).

If you are interested in more extensive benchmarks, you can have have a look here. Basically, the M1 Max is around 8 times slower than a RTX 3090 (the 3090 benchmark being run in fp16 precision for maximum speed), but consumes 8 times less power.

I ran the ResNet50 benchmark on my M1 Pro (16GB RAM) and achieved circa 65 img/sec (half the M1 Max throughput, as expected), the RAM pressure was sometimes orange during the benchmark.

I also ran the same benchmark on a RTX 2080 Ti (256 img/sec in fp32 precision, 620 img/sec in fp16 precision), and on a 2015 issued Geforce GTX TITAN X (128 img/sec in fp32 precision, 170 img/sec in fp16 precision).

Overall I think the M1 Max is a very promising GPU, especially considering it can be configured with 64 GB or RAM. Time will tell if the ML community will adopt those machine and if it will be relevant for PyTorch to be adapted as well.

Depending on the progress of subsequent Apple Silicon chip generations (and on the GPUs proposed on a future Mac Pro), deep learning on Mac might become attractive.
 
These are awful results to be honest. M1 Max should be much faster than that. With Apple it's really half a step forward and then an awkward dance in all directions... they release an open source tensor flow version, the suddenly drop it and replace it by a closed source plugin that's hidden away on their website, with no documentation, no changeling, no anything. Just make it open source. Give the community the opportunity to fix bugs.

Also, it seems that the tensorflow plugin is not using AMX accelerators, which are the fastest matrix hardware on the M1. Why?

Regarding your notes on FP16 and FP64: M1 GPU does not support FP64. FP16 gets promoted to FP32 in the ALUs. There is no performance difference between FP16 and FP32 (except that FP16 uses less of register files and can improve hardware utilization on complex shaders, but I doubt that this applies to ML).
 
I would be curious if you had evaluated Apples CoreML libraries mentioned here:

I'm a big fan of PyTorch and this is the one biggest frustration I have with the DL ecosystem right now...it's becoming a very closed-source/proprietary ecosystem. Outside of that, these systems have so much potential that is being wasted bc of the lack of support.
 
  • Like
Reactions: buckwheet
@CarbonCycles It seems Core ML/Neural Engine is good for inference, but not for training.

Does anyone know which hardware and deep learning library Apple use to train its models?
 
  • Like
Reactions: carbocation
Probably Nvidia hardware and Tensorflow or Pytorch librairies... is there any alternative ?
 
@CarbonCycles It seems Core ML/Neural Engine is good for inference, but not for training.

Does anyone know which hardware and deep learning library Apple use to train its models?
Yea, works great for peripheral IOT devices and some federated learning.

I would think Apple used a combination of NVIDIA & AMD GPUs to train.
 
  • Like
Reactions: Stakashi
I would say Pytorch on Linux with Nvidia GPUs. Pytorch seems to be more popular among researchers to develop new algorithms, so it would make sense that Apple would use Pytorch more than Tensorflow.

Are these new laptops suitable for reinforcement learning? I've read that reinforcement learning algorithms depends more on the CPU than on the GPU. Is it true? Any benchmark?
 
These are awful results to be honest. M1 Max should be much faster than that. With Apple it's really half a step forward and then an awkward dance in all directions... they release an open source tensor flow version, the suddenly drop it and replace it by a closed source plugin that's hidden away on their website, with no documentation, no changeling, no anything. Just make it open source. Give the community the opportunity to fix bugs.
Probably a symptom of competing teams butting heads. I’m sure there’s many open source advocates working for Apple, and many who want to keep everything they can proprietary.
 
In case anyone is interested, in ran a fairly simple MNIST benchmark (proposed here : https://github.com/apple/tensorflow_macos/issues/25) on my recently acquired M1 Pro MBP (16-core GPU, 16GB RAM). I installed Tensorflow using the following guide (https://developer.apple.com/metal/tensorflow-plugin/).

For reference, this benchmark seems to run at around 24ms/step on M1 GPU.

On the M1 Pro, the benchmark runs at between 11 and 12ms/step (twice the TFLOPs, twice as fast as an M1 chip).

The same benchmark run on an RTX-2080 (fp32 13.5 TFLOPS) gives 6ms/step and 8ms/step when run on a GeForce GTX Titan X (fp32 6.7 TFLOPs). A similar level of performance should be also expected on the M1 Max GPU (which should run twice as fast as the M1 Pro).

Of course, this benchmark runs a fairly simple CNN model but it already gives an idea. Keep also in mind that RTX generation cards are able to run faster at fp16 precision, I am not sure it would apply to Apple Silicon.

I would be happy to run any other benchmark if suggested (or help someone to run the benchmark on a M1 Max chip), even if I am more of a PyTorch guy. ;-)

[edit] Makes me wonder whether I should have gone for the M1 Max chip... probably not.
Be aware you are running code that was written using TensorFlow V1. This is an old and obsolete version of TF that may or may not be getting any updates. Tensorflow V2 has been out since 2019 or so and is where most development effort is centered.

If they have a more recent benchmark I suggest you use that.
 
Last edited:
Be aware you are running code that was written using TensorFlow V1. This is an old and obsolete version of TF that may or may not be getting any updates. Tensorflow V2 has been out since 2019 or so and is where most development effort is centered.

If they have a more recent benchmark I suggest you use that.
Honestly it did not run long enough for me to evaluate battery-wise. Heat-wise, I have not heard the fan once since I have the machine, and the computer was slightly warm at some point (but orders of magnitudes cooler than my 2015 13 or my 2020 13).

If you are interested in more extensive benchmarks, you can have have a look here. Basically, the M1 Max is around 8 times slower than a RTX 3090 (the 3090 benchmark being run in fp16 precision for maximum speed), but consumes 8 times less power.

I ran the ResNet50 benchmark on my M1 Pro (16GB RAM) and achieved circa 65 img/sec (half the M1 Max throughput, as expected), the RAM pressure was sometimes orange during the benchmark.

I also ran the same benchmark on a RTX 2080 Ti (256 img/sec in fp32 precision, 620 img/sec in fp16 precision), and on a 2015 issued Geforce GTX TITAN X (128 img/sec in fp32 precision, 170 img/sec in fp16 precision).

Overall I think the M1 Max is a very promising GPU, especially considering it can be configured with 64 GB or RAM. Time will tell if the ML community will adopt those machine and if it will be relevant for PyTorch to be adapted as well.

Depending on the progress of subsequent Apple Silicon chip generations (and on the GPUs proposed on a future Mac Pro), deep learning on Mac might become attractive.

This second benchmark was run on V2 code.
 
It seems Apple's Tensorflow fork doesn't support all Tensorflow raw_ops. So, we need to wait a little more to see the true potential of these new GPUs regarding deep learning.
 
I must say the above benchmark is a bit disappointing, it's a pity Apple doesn't allow TensorFlow to use the neural engine

Base on the above result, it could be a good machine for Reinforcement learning, a domain where the neural network is small and there is a lot of CPU<->GPU communication
 
I must say the above benchmark is a bit disappointing, it's a pity Apple doesn't allow TensorFlow to use the neural engine

Neural engine is limited purpose. The “real” deep learning accelerator is the AMX unit, but it’s unclear whether Pro/Max include more AMX resources.
 
Tensorflow and Pytorch are open source projects. So, Apple could provide them with a Metal backend as it is doing with Blender, an open-source 3D computer graphics software.

By the way, it seems that the Neural Engine is a little tricky to use. https://github.com/hollance/neural-engine
It seems that Apple is working with Google and Facebook to have a metal backend for Tensorflow (already the case) and Pytorch (WIP it seems). However, if it uses only GPU acceleration (as is the case with Tensorflow right now), we might not get groundbreaking performance. It may be nice for 1 epoch (i.e., prototyping) but not for large model training.
Letting developer uses the AME would be huge, but I don't see Apple doing this in a near future, I hope I'm wrong...

This repo is very interesting, thanks for sharing :)

I'm waiting a bit to see how the situation evolves, especially with Pytorch
 
Neural engine is limited purpose. The “real” deep learning accelerator is the AMX unit, but it’s unclear whether Pro/Max include more AMX resources.
Aren't the AMX units inside the AME? From Apple's website, it seems the number of Neural Engine cores are the same across the pro/max version
 
Aren't the AMX units inside the AME? From Apple's website, it seems the number of Neural Engine cores are the same across the pro/max version

No, AMX and AME are two different things. The confusing thing is that Apple offers three ways of doing ML on their hardware: AME, AMX and the GPU. The AME seems to be limited to tasks like audio and image processing that Apple uses for their apps, AMX is a general purpose matrix multiplication (good for model learning) and the GPU is the most flexible but also the least efficient of the three.
 
No, AMX and AME are two different things. The confusing thing is that Apple offers three ways of doing ML on their hardware: AME, AMX and the GPU. The AME seems to be limited to tasks like audio and image processing that Apple uses for their apps, AMX is a general purpose matrix multiplication (good for model learning) and the GPU is the most flexible but also the least efficient of the three.
Something doesn't make sense...it seems like the AMX is a souped up math coprocessor, but how is it more efficient than using GPU? Something else is in play (i.e., they built the LApack/BLAS directly into the hardware instruction sets?!?)

ETA:
After reading the medium article on the AMX coprocessor, it makes more sense as they have highly tuned those math libraries to AMX. Kind of disturbing though...Apple can really mess with this ecosystem since it's closed-off (i.e., Apple's little secret).
 
Last edited:
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.