Apple Silicon deep learning performance

M4pro · Saturday at 7:11 AM

(From 9to5 Mac)

At this year’s NeurIPS event, Apple will have a booth (#1103) where attendees will be able to interact with live demos of the multiple machine learning initiatives from the company, including:

MLX – an open source array framework designed for Apple silicon that enables fast and flexible ML and scientific computing on Apple hardware. The framework is optimized for Apple silicon’s unified memory architecture and leverages both the CPU and GPU. Visitors will be able to experience two MLX demos:

Image generation with a large diffusion model on an iPad Pro with M5 chip

Distributed compute with MLX and Apple silicon: Visitors will be able to explore text and code generation with a 1 trillion-parameter model running in Xcode on a cluster of four Mac Studios equipped with M3 Ultra chips, each operating with 512 GBs of unified memory.

FastVLM – a family of mobile-friendly vision language models, built using MLX. These models use a mix of CNN and Transformer architectures for vision encoding designed specifically for processing high-resolution images. Together, they demonstrate a strong approach that achieves an optimal balance between accuracy and speed. Visitors will get to experience a real-time visual question-and-answer demo on iPhone 17 Pro Max.

deconstruct60 · Saturday at 12:04 PM

M4pro said:
(From 9to5 Mac)

At this year’s NeurIPS event, Apple will have a booth (#1103) where attendees will be able to interact with live demos of the multiple machine learning initiatives from the company, including:

Somewhat adjacent post on 9to5mac.com.

" ...
Here’s Apple on these results:

On the architectures we tested in this post, the M5 provides 19-27% performance boost compared to the M4, thanks to its greater memory bandwidth (120GB/s for the M4, 153GB/s for the M5, which is 28% higher).

..."

Apple shows how much faster the M5 runs local LLMs on MLX - 9to5Mac

A new post on Apple’s Machine Learning Research blog shows how much the M5 improved over the M4 when it comes to running a local LLM.

9to5mac.com

the generation cap flows with the memory bandwidth boost. ( some overhead, but most of them in the 24+% range versus the 28% ). Apple has had flexibility in the LPDDR5 clock range to place the Pro/Max memory at higher speed than the plain 'Mn' version. The M5 moving up could narrow the gap. And if 'high than the spec' speeds are available in large enough volumes. Or wanting for LPDRR6 ?

Overall, Looks like they substantially got much better at 4-bit compute. ( as opposed to more compute).

OptimusGrime · Saturday at 12:42 PM

This seems to indicate the new Thunderbolt low latency protocol is RDMA.

Thunderbolt RDMA communications backend by angeloskath · Pull Request #2808 · ml-explore/mlx

Starting the PR. Will update with more info as docs and launcher are written and this goes out of draft.

github.com

deconstruct60 · Sunday at 11:13 AM

OptimusGrime said:
This seems to indicate the new Thunderbolt low latency protocol is RDMA.

Thunderbolt RDMA communications backend by angeloskath · Pull Request #2808 · ml-explore/mlx

Starting the PR. Will update with more info as docs and launcher are written and this goes out of draft.

github.com

There are advanced versions of Ethernet ( iWarp and RoCE ) that do RDMA . This isn’t necessarily ‘Thunderbolt’ in and of itself.

More so that Apple is playing catch up in advanced networking and adding ‘iWarp’-like ( or RoCE-like) functionality. They are adding Infiniband verbs (ibv) API to the Ethernet point to point mode of Thunderbolt.

Linux has had libibverbs for a while now.

“ libibverbs is an implementation of the RDMA verbs for both Infiniband (according to the Infiniband specifications) and iWarp (iWARP verbs specifications). It handles the control path of creating, modifying, querying and destroying resources such as Protection Domains (PD), Completion Queues (CQ), Queue-Pairs (QP), Shared Receive Queues (SRQ), Address Handles (AH), Memory Regions (MR). It also handles sending and receiving data posted to QPs and SRQs, getting completions from CQs using polling and completions events. “

https://www.rdmamojo.com/2012/05/18/libibverbs/

So more code that was written to using Infinband ( Nvidia ) and iWarp should be easier to map over. However, this is a network data transmission facility . Not a generalized shared memory mechanism.

“…
RDMA technologies are based on networking concepts found in a traditional network but there are differences them and their counterparts in IP networks. The key difference is that RDMA provides a messaging service which applications can use to directly access the virtual memory on remote computers. … RDMA provides low latency through stack bypass and copy avoidance, reduces CPU utilization, reduces memory bandwidth bottlenecks and provides high bandwidth utilization. …”

Comparison of RDMA Technologies

docs.nvidia.com

There are rumors/rumblings that Apple is working with Broadcom on a SoC for Private cloud compute servers. This general library of ibv commands would probably get used on some subset of that network also. Pragmatically this could be an affordable Infinband . Slower top end but costs a lot less.

( e.g., the Nvidia DGX spark comes with some connectors rot make a limited budget Cluster also.)

dmccloud · Monday at 10:19 AM

deconstruct60 said:
( e.g., the Nvidia DGX spark comes with some connectors rot make a limited budget Cluster also.)

Alex Ziskind set up some Sparks in a cluster for a video. Despite Nvidia branding it ConnectX, it is actually a standardized specification for optical (fiber) networking (SFP), with different sub classifications based on supported speeds, even different connectors (SFP28 va SFP+). Both Ubiquiti and Sonnet Tech sell compatible routers, network cards and cables that support this standard, as does Micro Center if you're lucky enough to live near one.

OptimusGrime · Monday at 10:53 AM

leman · Tuesday at 6:53 AM

OptimusGrime said:
View attachment 2582110

It is RDMA. I was unable to find any official documentation, but the latest Xcode includes Infiniband headers and Infiniband is a RDMA standard for networking. Cool stuff.

OptimusGrime · Tuesday at 7:15 AM

leman said:
It is RDMA. I was unable to find any official documentation, but the latest Xcode includes Infiniband headers and Infiniband is a RDMA standard for networking. Cool stuff.

Very cool indeed. There are three new commands in 26.2.

Round trip latency seems ok for a first try.

leman · Tuesday at 7:31 AM

OptimusGrime said:
Very cool indeed. There are three new commands in 26.2.
View attachment 2582324
Round trip latency seems ok for a first try.
View attachment 2582325

The latency is quite impressive actually. That's just ~100x slower than RAM and should be better than a modern fast SSD. Not too bad for a pluggable cable connection.

I do find even lower latency figures for professional Infiniband adapters, but that stuff is priced similar to a Mac Mini...

P.S. Did they actually test a connection between two devices or is this a loopback interface? If the latter, I'd like to see actual cross-machine measurements.

OptimusGrime · Tuesday at 8:07 AM

leman said:
The latency is quite impressive actually. That's just ~100x slower than RAM and should be better than a modern fast SSD. Not too bad for a pluggable cable connection.

I do find even lower latency figures for professional Infiniband adapters, but that stuff is priced similar to a Mac Mini...

P.S. Did they actually test a connection between two devices or is this a loopback interface? If the latter, I'd like to see actual cross-machine measurements.

It’s between a M3 Ultra and M4 pro.

leman · Tuesday at 10:28 PM

OptimusGrime said:
It’s between a M3 Ultra and M4 pro.

Oh, that's not bad at all! So it supports older devices as well as heterogeneous systems? Very nice!

OptimusGrime · Wednesday at 5:38 AM

leman said:
Oh, that's not bad at all! So it supports older devices as well as heterogeneous systems? Very nice!

It’s limited to Thunderbolt 5.

deconstruct60 · Wednesday at 4:18 PM

OptimusGrime said:
View attachment 2582110

Somewhat odd added to IOkit when it is primarily deprecated .
But if doesn’t fit in Driver kit scope then don’t really have a choice.
If IOkit dies with intel macOS and the only path to RDMA is through Thunderbolt , then that is rubbish .

M4pro · 2025-11-27T09:40:16-0800

Some Tahoe 26.2 beta 3 talk (from Macworld. Note - their coverage doesn’t identify what specific “improvements” they think exist in beta 3. Seems like they are re-hashing beta 2):

“Apple released macOS Tahoe 26.2 beta 3 earlier this week, introducing enhancements to AI features for developers. These updates enable the use of the open-source MLX framework with the Neural Accelerator on the M5 chip and simplify AI cluster creation via Mac Studio’s Thunderbolt 5 connectivity.

The first AI feature appeared in macOS Tahoe 26.2 beta 2 several weeks prior and received further improvements in beta 3. This addition allows developers implementing the open-source MLX framework to leverage the Neural Accelerator within the new M5 chip. Apple’s white paper details the framework’s versatility, stating that MLX can be used “for a wide variety of applications ranging from numerical simulations and scientific computing to machine learning.” The document also includes performance benchmarks demonstrating the Neural Accelerator’s impact. Specifically, these tests show the M5 chip achieves speeds 19 to 27 percent higher than the M4 chip under these conditions.

The second AI enhancement focuses on cluster formation using Mac Studio hardware. Developers can now assemble AI clusters through the device’s Thunderbolt 5 connectivity. This method streamlines the process compared to alternatives that require RDMA Ethernet cards and optical modules. As an illustration, engineers built a cluster comprising four M3 Ultra Mac Studios to execute an early-access version of Exo 1.0. Exo 1.0 serves as experimental software designed to permit users to construct and operate AI clusters in home environments. During operation, this cluster consumed less than 500 watts of power, a figure approximately ten times lower than that of a standard GPU-based cluster.”

And here’s some detail and numbers taken directly from Apple’s whitepaper: (deconstruct60 provides a link to 9to5mac’s coverage of this info in a post above)

“Inference Performance on M5 with MLX

The GPU Neural Accelerators introduced with the M5 chip provides dedicated matrix-multiplication operations, which are critical for many machine learning workloads. MLX leverages the Tensor Operations (TensorOps) and Metal Performance Primitives framework introduced with Metal 4 to support the Neural Accelerators’ features. To illustrate the performance of M5 with MLX, we benchmark a set of LLMs with different sizes and architectures, running on MacBook Pro with M5 and 24GB of unified memory, that we compare against a similarly configured MacBook Pro M4.

We evaluate Qwen 1.7B and 8B, in native BF16 precision, and 4-bit quantized Qwen 8B and Qwen 14B models. In addition, we benchmark two Mixture of Experts (MoE): Qwen 30B (3B active parameters, 4-bit quantized) and GPT OSS 20B (in native MXFP4 precision). Evaluation is performed with mlx_lm.generate, and reported in terms of time to first token generation (in seconds), and generation speed (in terms of token/s). In all these benchmarks, the prompt size is 4096. Generation speed was evaluated when generating 128 additional tokens.

Model performance is reported in terms of time to first token (TTFT) for both M4 and M5 MacBook Pro, along with corresponding speedup.

Time to First Token (TTFT)

Figure 1: TTFT in seconds (smaller is better) for different LLMs run with MLX on M4 and M5 MacBook Pro. Speedup values listed underneath each model name.
In LLM inference, generating the first token is compute-bound, and takes full advantage of the Neural Accelerators. The M5 pushes the time-to-first-token generation under 10 seconds for a dense 14B architecture, and under 3 seconds for a 30B MoE, delivering strong performance for these architectures on a MacBook Pro.

Generating subsequent tokens is bounded by memory bandwidth, rather than by compute ability. On the architectures we tested in this post, the M5 provides 19-27% performance boost compared to the M4, thanks to its greater memory bandwidth (120GB/s for the M4, 153GB/s for the M5, which is 28% higher). Regarding memory footprint, the MacBook Pro 24GB can easily hold a 8B in BF16 precision or a 30B MoE 4-bit quantized, keeping the inference workload under 18GB for both of these architectures.

	TTFT Speedup	Generation Speedup	Memory (GB)
Qwen3-1.7B-MLX-bf16	3.57	1.27	4.40
Qwen3-8B-MLX-bf16	3.62	1.24	17.46
Qwen3-8B-MLX-4bit	3.97	1.24	5.61
Qwen3-14B-MLX-4bit	4.06	1.19	9.16
gpt-oss-20b-MXFP4-Q4	3.33	1.24	12.08
Qwen3-30B-A3B-MLX-4bit	3.52	1.25	17.31

Table 1: Inference speedup achieved for different LLMs with MLX on M5 MacBook Pro (compared to M4) for TTFT and subsequent token generation, with corresponding memory demands. TTFT is compute-bound, while generation is memory-bandwidth-bound.

The GPU Neural Accelerators shine with MLX on ML workloads involving large matrix multiplications, yielding up to 4x speedup compared to a M4 baseline for time-to-first-token in language model inference. Similarly, generating a 1024x1024 image with FLUX-dev-4bit (12B parameters) with MLX is more than 3.8x faster on a M5 than it is on a M4...”

OptimusGrime · 2025-11-27T10:32:40-0800

M4pro said:
Some Tahoe 26.2 beta 3 talk (from Macworld. Note - their coverage doesn’t identify what specific “improvements” they think exist in beta 3. Seems like they are re-hashing beta 2):

“Apple released macOS Tahoe 26.2 beta 3 earlier this week, introducing enhancements to AI features for developers. These updates enable the use of the open-source MLX framework with the Neural Accelerator on the M5 chip and simplify AI cluster creation via Mac Studio’s Thunderbolt 5 connectivity.

Annoyingly there is currently no option to use both RDMA over TB and the M5’s Neural Accelerators. Hopefully the M5 Pro and Max arrive early next year.

mr_roboto · 2025-11-27T13:38:44-0800

deconstruct60 said:
Somewhat odd added to IOkit when it is primarily deprecated .
But if doesn’t fit in Driver kit scope then don’t really have a choice.
If IOkit dies with intel macOS and the only path to RDMA is through Thunderbolt , then that is rubbish .

Stop. Think about what you're saying. IOKit is deprecated and about to go away, yet Apple engineers (who would surely be aware of that) have chosen to develop RDMA on top of this dead-end tech? This should be a sign to you that some of your assumptions are profoundly wrong.

The truth is that IOKit is not deprecated. Discouraging its use by anyone outside Apple is not the same as wanting to throw it away. An additional truth is that deprecation is not always an indication that Apple plans to remove something tomorrow. Some deprecated APIs stick around for decades. Don't believe me? Check the IOKit API docs! They're evidence for both of these truths.

IOKit | Apple Developer Documentation

Access hardware devices and drivers from your apps and services.

developer.apple.com

Expand categories in the left hand column and scroll to find individual deprecated IOKit APIs (each of which are marked by an unmistakable red "Deprecated" tag). You will find that very little of IOKit is deprecated, and the portions which are seem to be ancient stuff deprecated long before DriverKit even existed.

A somewhat funny example I just found is IOOpenFirmwarePathMatching(). Open Firmware was what PowerPC Macs used as firmware, so it's been almost 20 years since it was relevant in shipping hardware. Technically, Apple could have removed everything related to Open Firmware in 2009's Snow Leopard, the first version of Mac OS X which dropped PowerPC support. Instead, they kept this API around, deprecated but not gone. At some point they even added Swift bindings, despite the fact that no official Apple Swift compiler (that I'm aware of) can target PowerPC.

Apple Silicon deep learning performance

macrumors regular

macrumors G5

macrumors 6502

macrumors G5

macrumors 68040

macrumors 6502

macrumors Core

macrumors 6502

macrumors Core

macrumors 6502

macrumors Core

macrumors 6502

macrumors G5

macrumors regular

“Inference Performance on M5 with MLX​

Time to First Token (TTFT)​

macrumors 6502

macrumors 6502a

Our Staff

“Inference Performance on M5 with MLX

Time to First Token (TTFT)