Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
(From 9to5 Mac)

At this year’s NeurIPS event, Apple will have a booth (#1103) where attendees will be able to interact with live demos of the multiple machine learning initiatives from the company, including:

  • MLX – an open source array framework designed for Apple silicon that enables fast and flexible ML and scientific computing on Apple hardware. The framework is optimized for Apple silicon’s unified memory architecture and leverages both the CPU and GPU. Visitors will be able to experience two MLX demos:
    • Image generation with a large diffusion model on an iPad Pro with M5 chip
    • Distributed compute with MLX and Apple silicon: Visitors will be able to explore text and code generation with a 1 trillion-parameter model running in Xcode on a cluster of four Mac Studios equipped with M3 Ultra chips, each operating with 512 GBs of unified memory.
  • FastVLM – a family of mobile-friendly vision language models, built using MLX. These models use a mix of CNN and Transformer architectures for vision encoding designed specifically for processing high-resolution images. Together, they demonstrate a strong approach that achieves an optimal balance between accuracy and speed. Visitors will get to experience a real-time visual question-and-answer demo on iPhone 17 Pro Max.
 
(From 9to5 Mac)

At this year’s NeurIPS event, Apple will have a booth (#1103) where attendees will be able to interact with live demos of the multiple machine learning initiatives from the company, including:

Somewhat adjacent post on 9to5mac.com.

" ...
Here’s Apple on these results:

On the architectures we tested in this post, the M5 provides 19-27% performance boost compared to the M4, thanks to its greater memory bandwidth (120GB/s for the M4, 153GB/s for the M5, which is 28% higher).
..."

the generation cap flows with the memory bandwidth boost. ( some overhead, but most of them in the 24+% range versus the 28% ). Apple has had flexibility in the LPDDR5 clock range to place the Pro/Max memory at higher speed than the plain 'Mn' version. The M5 moving up could narrow the gap. And if 'high than the spec' speeds are available in large enough volumes. Or wanting for LPDRR6 ?

Overall, Looks like they substantially got much better at 4-bit compute. ( as opposed to more compute).
 
This seems to indicate the new Thunderbolt low latency protocol is RDMA.

There are advanced versions of Ethernet ( iWarp and RoCE ) that do RDMA . This isn’t necessarily ‘Thunderbolt’ in and of itself.


More so that Apple is playing catch up in advanced networking and adding ‘iWarp’-like ( or RoCE-like) functionality. They are adding Infiniband verbs (ibv) API to the Ethernet point to point mode of Thunderbolt.

Linux has had libibverbs for a while now.

“ libibverbs is an implementation of the RDMA verbs for both Infiniband (according to the Infiniband specifications) and iWarp (iWARP verbs specifications). It handles the control path of creating, modifying, querying and destroying resources such as Protection Domains (PD), Completion Queues (CQ), Queue-Pairs (QP), Shared Receive Queues (SRQ), Address Handles (AH), Memory Regions (MR). It also handles sending and receiving data posted to QPs and SRQs, getting completions from CQs using polling and completions events. “



So more code that was written to using Infinband ( Nvidia ) and iWarp should be easier to map over. However, this is a network data transmission facility . Not a generalized shared memory mechanism.

“…
RDMA technologies are based on networking concepts found in a traditional network but there are differences them and their counterparts in IP networks. The key difference is that RDMA provides a messaging service which applications can use to directly access the virtual memory on remote computers. … RDMA provides low latency through stack bypass and copy avoidance, reduces CPU utilization, reduces memory bandwidth bottlenecks and provides high bandwidth utilization. …”


There are rumors/rumblings that Apple is working with Broadcom on a SoC for Private cloud compute servers. This general library of ibv commands would probably get used on some subset of that network also. Pragmatically this could be an affordable Infinband . Slower top end but costs a lot less.

( e.g., the Nvidia DGX spark comes with some connectors rot make a limited budget Cluster also.)
 
  • Like
Reactions: M4pro
( e.g., the Nvidia DGX spark comes with some connectors rot make a limited budget Cluster also.)

Alex Ziskind set up some Sparks in a cluster for a video. Despite Nvidia branding it ConnectX, it is actually a standardized specification for optical (fiber) networking (SFP), with different sub classifications based on supported speeds, even different connectors (SFP28 va SFP+). Both Ubiquiti and Sonnet Tech sell compatible routers, network cards and cables that support this standard, as does Micro Center if you're lucky enough to live near one.
 
1764010409315.png
 
  • Like
Reactions: M4pro
It is RDMA. I was unable to find any official documentation, but the latest Xcode includes Infiniband headers and Infiniband is a RDMA standard for networking. Cool stuff.
Very cool indeed. There are three new commands in 26.2.
1764083637261.png

Round trip latency seems ok for a first try.
1764083736274.png
 
  • Like
Reactions: M4pro
Very cool indeed. There are three new commands in 26.2.
View attachment 2582324
Round trip latency seems ok for a first try.
View attachment 2582325
The latency is quite impressive actually. That's just ~100x slower than RAM and should be better than a modern fast SSD. Not too bad for a pluggable cable connection.

I do find even lower latency figures for professional Infiniband adapters, but that stuff is priced similar to a Mac Mini...

P.S. Did they actually test a connection between two devices or is this a loopback interface? If the latter, I'd like to see actual cross-machine measurements.
 
Last edited:
The latency is quite impressive actually. That's just ~100x slower than RAM and should be better than a modern fast SSD. Not too bad for a pluggable cable connection.

I do find even lower latency figures for professional Infiniband adapters, but that stuff is priced similar to a Mac Mini...

P.S. Did they actually test a connection between two devices or is this a loopback interface? If the latter, I'd like to see actual cross-machine measurements.
It’s between a M3 Ultra and M4 pro.
 
  • Like
Reactions: M4pro
Some Tahoe 26.2 beta 3 talk (from Macworld. Note - their coverage doesn’t identify what specific “improvements” they think exist in beta 3. Seems like they are re-hashing beta 2):

“Apple released macOS Tahoe 26.2 beta 3 earlier this week, introducing enhancements to AI features for developers. These updates enable the use of the open-source MLX framework with the Neural Accelerator on the M5 chip and simplify AI cluster creation via Mac Studio’s Thunderbolt 5 connectivity.

The first AI feature appeared in macOS Tahoe 26.2 beta 2 several weeks prior and received further improvements in beta 3. This addition allows developers implementing the open-source MLX framework to leverage the Neural Accelerator within the new M5 chip. Apple’s white paper details the framework’s versatility, stating that MLX can be used “for a wide variety of applications ranging from numerical simulations and scientific computing to machine learning.” The document also includes performance benchmarks demonstrating the Neural Accelerator’s impact. Specifically, these tests show the M5 chip achieves speeds 19 to 27 percent higher than the M4 chip under these conditions.

The second AI enhancement focuses on cluster formation using Mac Studio hardware. Developers can now assemble AI clusters through the device’s Thunderbolt 5 connectivity. This method streamlines the process compared to alternatives that require RDMA Ethernet cards and optical modules. As an illustration, engineers built a cluster comprising four M3 Ultra Mac Studios to execute an early-access version of Exo 1.0. Exo 1.0 serves as experimental software designed to permit users to construct and operate AI clusters in home environments. During operation, this cluster consumed less than 500 watts of power, a figure approximately ten times lower than that of a standard GPU-based cluster.”

And here’s some detail and numbers taken directly from Apple’s whitepaper: (deconstruct60 provides a link to 9to5mac’s coverage of this info in a post above)

“Inference Performance on M5 with MLX​

The GPU Neural Accelerators introduced with the M5 chip provides dedicated matrix-multiplication operations, which are critical for many machine learning workloads. MLX leverages the Tensor Operations (TensorOps) and Metal Performance Primitives framework introduced with Metal 4 to support the Neural Accelerators’ features. To illustrate the performance of M5 with MLX, we benchmark a set of LLMs with different sizes and architectures, running on MacBook Pro with M5 and 24GB of unified memory, that we compare against a similarly configured MacBook Pro M4.

We evaluate Qwen 1.7B and 8B, in native BF16 precision, and 4-bit quantized Qwen 8B and Qwen 14B models. In addition, we benchmark two Mixture of Experts (MoE): Qwen 30B (3B active parameters, 4-bit quantized) and GPT OSS 20B (in native MXFP4 precision). Evaluation is performed with mlx_lm.generate, and reported in terms of time to first token generation (in seconds), and generation speed (in terms of token/s). In all these benchmarks, the prompt size is 4096. Generation speed was evaluated when generating 128 additional tokens.

Model performance is reported in terms of time to first token (TTFT) for both M4 and M5 MacBook Pro, along with corresponding speedup.

Time to First Token (TTFT)​

Figure 1: TTFT in seconds (smaller is better) for different LLMs run with MLX on M4 and M5 MacBook Pro. Speedup values listed underneath each model name.
In LLM inference, generating the first token is compute-bound, and takes full advantage of the Neural Accelerators. The M5 pushes the time-to-first-token generation under 10 seconds for a dense 14B architecture, and under 3 seconds for a 30B MoE, delivering strong performance for these architectures on a MacBook Pro.

Generating subsequent tokens is bounded by memory bandwidth, rather than by compute ability. On the architectures we tested in this post, the M5 provides 19-27% performance boost compared to the M4, thanks to its greater memory bandwidth (120GB/s for the M4, 153GB/s for the M5, which is 28% higher). Regarding memory footprint, the MacBook Pro 24GB can easily hold a 8B in BF16 precision or a 30B MoE 4-bit quantized, keeping the inference workload under 18GB for both of these architectures.

TTFT SpeedupGeneration SpeedupMemory (GB)
Qwen3-1.7B-MLX-bf163.571.274.40
Qwen3-8B-MLX-bf163.621.2417.46
Qwen3-8B-MLX-4bit3.971.245.61
Qwen3-14B-MLX-4bit4.061.199.16
gpt-oss-20b-MXFP4-Q43.331.2412.08
Qwen3-30B-A3B-MLX-4bit3.521.2517.31
Table 1: Inference speedup achieved for different LLMs with MLX on M5 MacBook Pro (compared to M4) for TTFT and subsequent token generation, with corresponding memory demands. TTFT is compute-bound, while generation is memory-bandwidth-bound.

The GPU Neural Accelerators shine with MLX on ML workloads involving large matrix multiplications, yielding up to 4x speedup compared to a M4 baseline for time-to-first-token in language model inference. Similarly, generating a 1024x1024 image with FLUX-dev-4bit (12B parameters) with MLX is more than 3.8x faster on a M5 than it is on a M4...”
 
Last edited:
Some Tahoe 26.2 beta 3 talk (from Macworld. Note - their coverage doesn’t identify what specific “improvements” they think exist in beta 3. Seems like they are re-hashing beta 2):

“Apple released macOS Tahoe 26.2 beta 3 earlier this week, introducing enhancements to AI features for developers. These updates enable the use of the open-source MLX framework with the Neural Accelerator on the M5 chip and simplify AI cluster creation via Mac Studio’s Thunderbolt 5 connectivity.
Annoyingly there is currently no option to use both RDMA over TB and the M5’s Neural Accelerators. Hopefully the M5 Pro and Max arrive early next year.
 
  • Like
Reactions: M4pro
Somewhat odd added to IOkit when it is primarily deprecated .
But if doesn’t fit in Driver kit scope then don’t really have a choice.
If IOkit dies with intel macOS and the only path to RDMA is through Thunderbolt , then that is rubbish .
Stop. Think about what you're saying. IOKit is deprecated and about to go away, yet Apple engineers (who would surely be aware of that) have chosen to develop RDMA on top of this dead-end tech? This should be a sign to you that some of your assumptions are profoundly wrong.

The truth is that IOKit is not deprecated. Discouraging its use by anyone outside Apple is not the same as wanting to throw it away. An additional truth is that deprecation is not always an indication that Apple plans to remove something tomorrow. Some deprecated APIs stick around for decades. Don't believe me? Check the IOKit API docs! They're evidence for both of these truths.


Expand categories in the left hand column and scroll to find individual deprecated IOKit APIs (each of which are marked by an unmistakable red "Deprecated" tag). You will find that very little of IOKit is deprecated, and the portions which are seem to be ancient stuff deprecated long before DriverKit even existed.

A somewhat funny example I just found is IOOpenFirmwarePathMatching(). Open Firmware was what PowerPC Macs used as firmware, so it's been almost 20 years since it was relevant in shipping hardware. Technically, Apple could have removed everything related to Open Firmware in 2009's Snow Leopard, the first version of Mac OS X which dropped PowerPC support. Instead, they kept this API around, deprecated but not gone. At some point they even added Swift bindings, despite the fact that no official Apple Swift compiler (that I'm aware of) can target PowerPC.
 
It's great to see how Apple is improving MLX.
 
There is an interesting comparison between MLX and Core AI.

View attachment 2637347
More info: https://github.com/john-rocky/apple-silicon-llm-bench

It's important to understand what these sorts of benchmarks do and don't tell us. I raise this point because I keep seeing confusion.

1. CoreAI is about packaging up a model you acquire from elsewhere and getting it to run on Apple Silicon. Just like CoreML.
It is NOT a "convert model to ANE" tool. Many people were hoping it would be such a tool (with CoreML staying in its existing role), but things are what they are.
So if your goal is "optimize a model for ANE", things remain as they were with CoreML: lots of spelunking through where each layer runs and figuring out why it chose not to run on ANE. CoreML *may* make this easier in that various limits (especially those that look artificial, eg various size limits) may have been raised.

2. This benchmark appears to show the GPU as (generally) much faster than ANE. This can be true *for non-Apple models*. Apple has published in many places anything from hints to outright definite statements about what runs optimally fast on ANE (for example ReLU is definitely preferred, as is an embedding vector ordering that tends to generate *long* clusters of zero activations). Many of these choices are very different from what's optimal on nVidia (eg nVidia very much prefers zeros spread evenly across weights and activations rather than Apple's long runs of zeros). These are not "better or worse", they are just different HW choices.

Bottom line is that
- Apple, building its models from scratch, gets much better performance out of ANE than do 3rd party models. This is not even, I think, because Apple uses secret functionality; it's just that taking a model optimized for nV HW choices and slapping it on ANE makes lousy use of ANE. This could be rectified somewhat if people bothered to understand how ANE really works, but that knowledge is still diffuse and most porters don't even seem aware of the resources available.
To do better, more sophisticated model modification is required (eg re-ordering embedding tables, with consequent reordering of all weights; or replacing activation functions with ReLU, which probably requires adding some LoRA layers to fix the resultant slight inaccuracies).

- As far as numbers go the single number summary line is that Apple's AFM3 (the new one in iOS27) runs at about 60..70 tps, and runs on ANE. There are a few more details here:
It runs at about half the speed of the PCC (cloud) model.
It's unclear if the model tested was the "baseline" AFM3 or the advanced (20B) AFM3.
It IS running on ANE even though there was initially some confusion about this because the various performance monitoring apps needed to be updated for OS27.

- I've mentioned before the analogy to codecs in the early days of QuickTime. QuickTime began life in 1991 supporting all codecs, and it wasn't clear to Apple (and anyone else!) how this would play out, if there would be a single dominant codec or a perpetual pool of 10 codecs used for different situations.
But by the mid nineties it was already clear that the extreme variety of choices was sub-optimal, that most people just wanted a "good enough" that was always there and always worked; AND (unlike a few years earlier), by 1997 Apple had access to such a codec in the form of Sorenson.
This is where we are now with AI models.
Sorenson was far from perfect, and far from state of the art. But it WAS good enough in terms of balancing everything relevant and feasible given the hardware of the time (including, as always, the installed base).

- Sorenson meant that the frontier was over. Crazy amateurs (like myself in 1990, getting MPEG-1 working on my Mac SE/30, or Sam in Australia getting PNG working) had a good run, but our time as amateurs had passed.
We'll continue to see people playing with hero ports of various models to Apple HW, but I suspect it's going to be less and less relevant to most people (unless Apple does really stupid things with guard rails *cough* Anthropic *cough*).
Most people will find Siri AI (and more precisely "the system as Apple ships it") just works for what they need, even if what they need is fairly specialized [OCR? Translation? Coding? File format conversion? all tbd, and probably all not as as good as they should be, but also subject to constant revision with each OS update].
Probably not if you are at the leading edge of OpenClaw deployment (while OS27 has a lot of agentic infrastructure, and some limited agentic functionality to get people used to the idea, it's very much not yet an agentic OS, let alone an "open-ended" agentic OS).

- But Sorenson also did NOT mean improvement was over. Even at the time Sorenson was clearly quality inferior to MPEG-1 (let alone h.263 done right), but it was fast enough on older HW, unlike those two. We'll continue to see local AFM improve, and we'll continue to see the routing between local AFM vs the two cloud models improve.
But the world that will matter for most people will be what Apple ships, not experiments on github. I say this not as a judgement, just as a statement of how these things play out. For those who have enjoyed being enthusiastic amateurs, I'd suggest the way to harness your talents going forward is no longer constant fiddling with porting new models, and more of either
+ unusual ways to exploit/optimize what Apple provides in the OS (ie clever new use cases, clever new wrapper apps)
+ exploring within the standard existing open models how they work, what they tell us about reasoning etc (eg clever model surgery, exploring looping models that reason in latent rather than token space, large concept models [perhaps with innovative tokenization or multi-tokenization schemes as an intermediate step], etc.
 
I wrote up some preliminary thoughts on "why" LLMs "work" here:

May be of interest to some people reading this thread.
You are right about the internal representation, there absolutely is one, and it's different literally per token and you can measure this with the right apparatus and science. Advanced models also, language wise, do "think" (ugh) in language that is abstract, Anthropic's system card about Fable 5 shows this very clearly and my experience of it was the same, I even had Russian in one of the output documents created by Fable and zero reason for it to be there.

Heatmaps don't tell the whole story, you have to look at geometry and that is ...very difficult. Once you understand the attractor landscape, activation patterns and how attention works internally a lot of things fall out of it. Mechanistic interpretability is a bitch, but it's fun.

You can get lots of interesting things happening if you hook into layers and interleave, for something I built I was able to 'train' some basic math with a totally scrambled corpus and it still worked.

Anthropic's papers are pretty good, also look at Goodfire who Anthropic is unsurprisingly heavily invested in.

One criticism I have of the linked articles (not your post, the source – I read some of his posts including the one from this year on the same topic) is that I don't see controls being done. When you benchmark specific things you're targeting improvement at you have to make sure you aren't causing degeneracy in other areas and that requires a very wide sampling, you should understand to the logit level what the impact is of the interventions being done when working with nonstandard techniques.

You can PM me if you have any specific questions since I don't want to clog the thread up, but I thought this might be worthwhile for a general audience.

I am really dying for that M5 Ultra Studio.

edit: also, look at how Gemma 4 works with alternating layers, a sliding window, and a final summation layer - when you design from scratch with some of those techniques in mind you can get a really powerful result. Gemma 4 can solve some HLE problems for example (at a token cost of ~2k-5k), but it's free on device albeit not exceptionally fast.
 
Last edited:
  • Like
Reactions: jido and name99
You are right about the internal representation, there absolutely is one, and it's different literally per token and you can measure this with the right apparatus and science. Advanced models also, language wise, do "think" (ugh) in language that is abstract, Anthropic's system card about Fable 5 shows this very clearly and my experience of it was the same, I even had Russian in one of the output documents created by Fable and zero reason for it to be there.

Heatmaps don't tell the whole story, you have to look at geometry and that is ...very difficult. Once you understand the attractor landscape, activation patterns and how attention works internally a lot of things fall out of it. Mechanistic interpretability is a bitch, but it's fun.

You can get lots of interesting things happening if you hook into layers and interleave, for something I built I was able to 'train' some basic math with a totally scrambled corpus and it still worked.

Anthropic's papers are pretty good, also look at Goodfire who Anthropic is unsurprisingly heavily invested in.

One criticism I have of the linked articles (not your post, the source – I read some of his posts including the one from this year on the same topic) is that I don't see controls being done. When you benchmark specific things you're targeting improvement at you have to make sure you aren't causing degeneracy in other areas and that requires a very wide sampling, you should understand to the logit level what the impact is of the interventions being done when working with nonstandard techniques.

You can PM me if you have any specific questions since I don't want to clog the thread up, but I thought this might be worthwhile for a general audience.

I am really dying for that M5 Ultra Studio.

edit: also, look at how Gemma 4 works with alternating layers, a sliding window, and a final summation layer - when you design from scratch with some of those techniques in mind you can get a really powerful result. Gemma 4 can solve some HLE problems for example (at a token cost of ~2k-5k), but it's free on device albeit not exceptionally fast.
I'm less concerned about controls and suchlike than you are 🙂 When virgin territory opens up, there's room for both wild explorers who simply go where they want and write down what they see, AND surveyors carefully mapping in great detail.

Thanks for the Gemma 4 ref. Now I have a whole lot of new reading ahead of me.
 
Everything currently available on the consumer market is very underpowered. Nvidia is set to launch the first dedicated AI accelerators this year. This will be the Nvidia Spark for the consumer market. The maximum AI computing power will be 1 Petaflop...
Everything else on the consumer market is, for now, just a gimmick. Note also that Nvidia and ATi are putting new graphics card launches on hold for two years. Apple is also standing still in this regard, unable to exceed the performance of the RX 6900XT... But in two or perhaps three years’ time, we’ll see a new generation of AI-capable graphics cards at a price accessible to almost everyone 😃
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.