Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
I am not good with hardware, would this allow Mac Pro M2 to use a GPU for inference through PCI-E slots?

I can imagine it would work in principle (that's an interesting question — do PCI drivers on Apple Silicon seamlessly work with both Thunderbolt and internal PCIe or is additional driver configuration needed?)

However, in practice you will be limited by the power connectors and the available PCI lanes. The Mac Pro only provides 300W of auxiliary power, and its 8-pin connector only delivers 150 watts, so at most you'd be able to fit an RTX 5060 without additional adapters. With adapters, maybe an RTX 5070 will be possible. And M2 Ultra only offers 24 lanes of PCIe 4.0, which is not enough to saturate the 16x PCI-e connection to modern Nvidia GPUs. Since we are talking about machine learning applications, that is something which will likely be noticeable.
 
  • Like
Reactions: keksikuningas
Was there ever a ban or is that a poetic addition by the author? Specifically, dud anyone try to get a GPU driver signed by Apple before and was refused?

IMO, all of this reads like a big bunch of nothing.
Like I said, I don't know the inside baseball on this.

My puzzlement with this is more that, if you want local access to an NVIDIA GPU—for whatever reason—it seems it makes more sense to just build a PC with an NVIDIA GPU rather than attaching an eGPU to your Mac.

As an example, according to this video, the build cost of a TB5 RTX5080 eGPU is $2,000, compared with $2,630 for an RTX5080 PC. Not much difference.


So I think it only makes sense if you already happen to have a spare NVIDIA GPU on hand. That's why I was focusing more on its potential implications for future cooperation betweeen Apple and NVIDIA than on its functionality, as the latter seems lacking.
 
Last edited:
I am not good with hardware, would this allow Mac Pro M2 to use a GPU for inference through PCI-E slots?

It 'should' but there might be a quirk in it. ( I suspect they are just not spending money on a Mac Pro to certify it.)

DriverKit has a general class for PCI drivers.


On the same page is a link to the Thunderbolt driver which is just a narrow superset.


All the 'qualification" thunderbolt driver is doing is adding an attribute that it is 'pause'/'hot plug' response capable. Also, if a data device, then it is also going to require a 'eject' before remove. The 'eject' thing doesn't matter for a GPU. There is really no directly property of thunderbolt data transport at issue here at all. It is just that folks can hot-plug/unplug at any moment. Technically, PCI-e standard hot-plug cards is optionally allowed too. [ The 2019/2023 towers nominally somewhat make hard to both have the system 'on' and and yank a card out at the same time. Most PCI-e drivers don't do 'hot-plug' stuff because it is not strictly required for all drivers. Usually, in context of PCI-e cards for 24/7 mainframes where going to pull a card out of a machine while it is running and it just automatically adjusts and moves forward. These attributes are in PCI-e standard too. So technically, it is not a Thunderbolt only issue. It is "Apple doesn't do 'big iron' context issue to blame it all on Thunderbolt. ]

If the driver says "I handle AMD GPU with ID 123" if the DriverKit matcher was anally restrictive then maybe it would skip when Tinygrad says "I can talk to ID 123". But if rigid for an internal PCI-e slot the "hotplug' attribute really doesn't matter. It is basically almost entirely safe attribute to ignore in that context. The main aspect of the driver 'matching' process is to pair PCI-e device IDs to those discovered on a sweep of the devices who say "I'm here" on the PCI-e bus sweep for the whole system. thunderbolt doesn't really play a direct role in how is on/off that collection (when everything is hooked up and static.) The 'matching' should ignore the attribute when not a hot-plug context. (the tinygrad driver having the attribute is immaterial. )

The core of Thunderbolt really not no issue here at all. The DriverKit class is for "PCI" devices. It is a PCI device.

If it doesn't work, I suspect it is more Apple's fault than Tinygrad. (anal device matcher).
 
  • Like
Reactions: keksikuningas
Like I said, I don't know the inside baseball on this.

My puzzlement with this is more that, if you want local access to an NVIDIA GPU—for whatever reason—it seems it makes more sense to just build a PC with an NVIDIA GPU rather than attaching an eGPU to your Mac.

As an example, according to this video, the build cost of a TB5 RTX5080 eGPU is $2,000, compared with $2,630 for an RTX5080 PC. Not much difference.

Yep, pretty much.

That's why I was focusing more on its potential implications for future cooperation betweeen Apple and NVIDIA than on its functionality, as the latter seems lacking.

I find it doubtful that there is any interest in such a cooperation from either of the companies. Not much money to be made here either. Apple will continue refining their GPU matrix accelerators and we'll have performance parity with consumer Nvidia stuff before long anyway.
 
Sorry, not following how this addresses my question of whether TB5 is supported.

If you put a Nvidia PCI-e v5 card in a PCI-e v3 slot in an older PC it stops working? Probably not.

Thunderbolt v5 doesn't matter. Or really shoud not matter. TB v1,2,3,4 all have PCI-e data tunneling. Having a tunnel is the critical part of the issue. Sending data back and forth is what matters. How PCI-e is transformed into Thunderbolt native data packets and transported is all completely transparent to user-land level software ( transparent to the vast majority of the OS kernel too. ). The TB controllers job is to take PCI-e data and drop it off on the other side without folks on either side of the tunnel having to do anything special about the data.

The USB/TB controllers on either side of the cable (and cable quality) are going to drive what those two negoiatate the ' speed limit' setting to be. It isn't the DriverKit driver.

The only place where some aspect of Thunderbolt breaks through the abstraction layer is when the cord is yanking out of the Thunderbolt jack 'hot'. The hot plug-and-play aspect disrupts most causally implemented PCI-e drivers because they leave out the option PCI-e standard implement stuff for hot-plugging. Most PC systems can't do it right so nobody implements it. But really nothing there that isn't already in the optional section of the PCI-e standard.

There are not "Thunderbolt" class drivers in DriverKIt. There are just PCI Drivers with attributes saying they can handle hot-plug contexts properly. Otherwise it is just a PCI-e device.


I highly doubt this is some very large budget development effort where trying to buy all of the latest, greatest hardware to put the PCI-e cards into. The GPU cards enough are expensive. More like a hobby project which putting in incremental effort to get it DriverKit formally done.
 
  • Like
Reactions: theorist9
That is indeed quite puzzling. It's like saying "hey, there is now a towing cable for a Prius, which is a game-changer for people who need to haul stones from a quarry!"
Tiny Corp produces a van (tinybox red, $12,000, 2x 1GbE + OCP3.0 PCIe4) and a small truck (tinybox green, $65,000, 2x 10GbE + OCP3.0 PCIe5):


Do we know if TinyGPU is designed to work with these? Seems unlikely, right? Apple would need to provide an OCP3.0 interconnect for that. Those networking specs remind me of the Apple servers seen in Houston, with only two I/O (two pairs per rack): one Ethernet GbE, the other probably Small Form Factor OCP3.0.

They also produce a literal shipping container (exabox, ~$10M, 3.2 TB/s networking). Pre-orders are fully refundable and only require you put down $100K!
 
Apple will continue refining their GPU matrix accelerators and we'll have performance parity with consumer Nvidia stuff before long anyway.
You and I discussed the Apple vs. NVIDIA GPU performance disparity a couple of years ago and, IIRC, the conclusion was that (while Apple had some interesting GPU architectural innovations), taken as a whole, NVIDIA's consumer GPU architecture was at least as advanced as Apple's, and in some ways more so.

Thus, at the time, the only way for Apple to achieve performance parity would have been to offer NVIDIA-like core counts and max clock speeds. Apple was unwilling to do this, since that would have resulted in NIVIDA-like GPU wattages.

For Apple to, in the future, achieve performance parity with, say, future RTX XX80 and XX90 desktops while keeping roughly within the current GPU TDP's of the Max and Ultra would require they leapfrog well beyond NVIDIA's future GPU architecture. And given this is NVIDIA's wheelhouse, it's hard to understand how they could do that.
 
Last edited:
You and I discussed the Apple vs. NVIDIA GPU performance disparity a couple of years ago and, IIRC, the conclusion was that (while Apple had some interesting GPU architectural innovations), taken as a whole, NVIDIA's consumer GPU architecture was at least as advanced as Apple's, and in some ways more so.

Thus, at the time, the only way for Apple to achieve performance parity would have been to offer NVIDIA-like core counts and max clock speeds. Apple was unwilling to do this, since that would have resulted in NIVIDA-like GPU wattages.

For Apple to, in the future, achieve performance parity with, say, future RTX XX80 and XX90 desktops while keeping roughly within the current GPU TDP's of the Max and Ultra would require they leapfrog well beyond NVIDIA's future GPU architecture. And given this is NVIDIA's wheelhouse, it's hard to understand how they could do that.

A lot has happened since then. Apple has superscalar GPU pipelines, matrix accelerators, flexible memory hierarchy, and their ALU utilization is best in class. In Blender benchmark suite, M5 Max is on par with 5090 mobile, although the later should be 2x faster according to the spec sheet. For half-precision matrix compute, M5 Max should be equivalent to a desktop 5070.

I’m not suggesting that Apple mobile will surpass Nvidia enthusiast desktop, that would be a tough nut to crack. At the same time, what most users care about is usable performance and not merely absolute performance. We have multiple reports of M5 Max achieving token generation speeds on popular LLMs suitable for coding agent usage. Sure, a 5090 might still be two or even three times faster - but that alone won’t increase your productivity and many folks prefer the flexibility of a laptop. So again, depends on what you are after? Apple doesn’t need to be faster than Nvidia to a hieve success in ML computer market - they just need to be fast enough to be usable and convenient.
 
Last edited:
  • Like
Reactions: cbum and M4pro
A lot has happened since then. Apple has superscalar GPU pipelines, matrix accelerators, flexible memory hierarchy, and their ALU utilization is best in class. In Blender benchmark suite, M5 Max is on par with 5090 mobile, although the later should be 2x faster according to the spec sheet. For half-precision matrix compute, M5 Max should be equivalent to a desktop 5070.

I’m not suggesting that Apple mobile will surpass Nvidia enthusiast desktop, that would be a tough nut to crack. At the same time, what most users care about is usable performance and not merely absolute performance. We have multiple reports of M5 Max achieving token generation speeds on popular LLMs suitable for coding agent usage. Sure, a 5090 might still be two or even three times faster - but that alone won’t increase your productivity and many folks prefer the flexibility of a laptop. So again, depends on what you are after? Apple doesn’t need to be faster than Nvidia to a hieve success in ML computer market - they just need to be fast enough to be usable and convenient.
Interesting to hear, thanks!

But note that AS performance relative to NVIDIA mobile GPU's is not that germane to this thread, which is about people getting NVIDIA eGPUs working with their Macs. And pretty much anyone who bothers to do that is going to be putting a desktop GPU in that eGPU box.

So when you said "parity", I assumed you meant parity with NVIDIA desktop devices--which you've explained won't (or is unlikely) to be achieved in the near future.
 
Last edited:
But note that AS performance relative to NVIDIA mobile GPU's is not that germane to this thread, which is about people getting NVIDIA eGPUs working with their Macs. And pretty much anyone who bothers to do that is going to be putting a desktop GPU in that eGPU box.

The challenge here is really the interconnect bottleneck. If the goal is machine learning applications that surpass the abilities of your Mac, then you also probably need plenty of memory bandwidth to move data around. Even with TB5, the GPU will be running at low efficiency. In fact, the weaker Mac GPU could provide superior experience for many use cases, simply because it has direct access to the data.

So when you said "parity", I assumed you meant parity with NVIDIA desktop devices--which you've explained won't (or is unlikely) to be achieved in the near future.

I wouldn't be surprised if M5 Ultra already gets very close to 5090 for certain ML use cases. And who knows what Apple has planned for the future.
 
  • Like
Reactions: M4pro
Tiny Corp produces a van (tinybox red, $12,000, 2x 1GbE + OCP3.0 PCIe4) and a small truck (tinybox green, $65,000, 2x 10GbE + OCP3.0 PCIe5):


Do we know if TinyGPU is designed to work with these? Seems unlikely, right? Apple would need to provide an OCP3.0 interconnect for that.

Apple provide interconnect??? In this context, OCP is a card form factor standard that wasn't designed in the 1980s-early 1990s.
OCP has a variety of projects that are addressing standardizing solutions for modern server problems. There is server (e.g. pluggable accelerators ).

Networking.. (which includes network interface cards NIC).

And a pretty broad set of projects.

[emphasis added. ]
"...Additionally, the machine features an empty 16x OCP 3.0 slot for networking, allowing for flexible connectivity options. ..."
Pragmatically what the specs are saying is that there is no high speed network card present in the default box.

What that 'OCP' is saying in the networking features section of the specifications is that you are not buying a Ethernet card in the 1990's form factor at a generic consumer electronics store. It is not saying it cannot be an Ethernet card.

OCP-NIC-3.0-W1-and-W2.png


or



For example

or

Broadcom%2057508%20100Gb%20OCP%20adapter.jpg


there are adapter packages to put some of these OCP Ethernet chips into a standard PCI-e card format if really want to but it is not necessary.

The Ethernet Jack that faces the outside of the Tinybox is same standard jack a number others servers use. There is nothing for Apple to 'adapt to" something that can get to a LAN network as long as put a standard network card in the TinyBox's slot. Really nothing different from the rest of the AI server set ups all around that take remote "Enthernet" clients.

Those networking specs remind me of the Apple servers seen in Houston, with only two I/O (two pairs per rack): one Ethernet GbE, the other probably Small Form Factor OCP3.0.

Probably not. That other socket was standard QSFP28 or SFP28 transceiver connectors. (i.e. the standard network connectors find a datacenter contexts.) Those were not card slots. [ 50-100-200. etc GbE network speeds , BaseT (generic twisted pair wiring) is not the norm anymore. ]

They also produce a literal shipping container (exabox, ~$10M, 3.2 TB/s networking). Pre-orders are fully refundable and only require you put down $100K!

And you get to that shipping container through standards based Ethernet also.... just like the rest of the Internet.


P.S. The TinyGrad framework already has code to 'ship' compute requests of a network socket to get work done. It isn't "Mac" specific. This work for the Mac is more about "cheaper" (more affordable ) developer seats for some folks to write code that eventually gets deployed to the tinybox hardware. Some folks will do single seat stuff and no Tinyhardware, but that is just a tolerable side-effect. Long term if it doesn't lead to substantive tinybox hardware sales then the mac port aspect may drift away in terms of effort.
 
Last edited:
So when you said "parity", I assumed you meant parity with NVIDIA desktop devices--which you've explained won't (or is unlikely) to be achieved in the near future.

Pragmatically, in the ML compute space and versus general consumer Nvidia Desktop GPUs they do have parity.
The VRAM in consumer GPU cards is capped pretty. And the AI models are on a 'sky is the limit' chase to use up more RAM. Once, a 5090 runs out VRAM it is not going to be faster than Apple's solution that have 2x ( or more) the amount of RAM workspace for the problems. Once the 'swamping virtual memory' effect kicks in (and especially over USB 4 or 4-gen2 limitations), that is a boat anchor on performance.

Nvidia's special adds for 16 , 8, 4 , 2 bit smaller and smaller sizes is a dead end game. There are not many bits left can throw away to make the compute smaller to boost the operation per second numbers.

If Apple doesn't want to be kind of the hill in video game frame rate... that too is silicon area overhead they don't have to allocate that Nvidia does. (there are trade-offs. )

For a very narrowly, cherry picked, subset of problems Apple can close the gap over time, because Nvidia has to cover a broader set of solutions with a finite amount of silicon to address to each of those problems. ( e.g., Nvidia isn't beating Apple on ProRes processing. )
 
Nvidia's special adds for 16 , 8, 4 , 2 bit smaller and smaller sizes is a dead end game. There are not many bits left can throw away to make the compute smaller to boost the operation per second numbers.

Great point! This is another area where Apple's strategy seems to differ significantly. While their hardware currently lags in performance, it offers more flexibility regarding data layouts and types. For instance, Apple can transpose matrices on the fly, and it also appears that the execution units support mixed-precision operation to a limited extent. In practice this means simpler shader code (and thus better occupancy), less data movement, more efficient use of matrix units, and less time spent on optimizing the code. I am very curious to see how this plays out over time, and I can imagine that offering this kind of flexibility can amortize lower peak performance.
 
  • Like
Reactions: M4pro
The VRAM in consumer GPU cards is capped pretty. And the AI models are on a 'sky is the limit' chase to use up more RAM. Once, a 5090 runs out VRAM it is not going to be faster than Apple's solution that have 2x ( or more) the amount of RAM workspace for the problems. Once the 'swamping virtual memory' effect kicks in (and especially over USB 4 or 4-gen2 limitations), that is a boat anchor on performance.
Sure. But unless someone specifically wants or needs to run CUDA code, there's no reason to try to get an NVIDIA eGPU running on a Mac unless it significantly increases performance.

Hence I assume those trying it for performance reasons would be running the subset of compute tasks that don't need the high amounts of VRAM that Macs provide. I.e., for the enthusiasts that are seriously considering this, the VRAM disparity wouldn't be relevant since otherwise they wouldn't even be thinking about it in the first place.

Thus I think the main reason it doesn't make sense to do this is instead what I stated earlier: It's cleaner, more performant, and not much more costly, to just buy an NVIDIA PC:

... if you want local access to an NVIDIA GPU—for whatever reason—it seems it makes more sense to just build a PC with an NVIDIA GPU rather than attaching an eGPU to your Mac.

As an example, according to this video, the build cost of a TB5 RTX5080 eGPU is $2,000, compared with $2,630 for an RTX5080 PC. Not much difference.


So I think it only makes sense if you already happen to have a spare NVIDIA GPU on hand.
 
  • Like
Reactions: Basic75 and leman
Sure. But unless someone specifically wants or needs to run CUDA code, there's no reason to try to get an NVIDIA eGPU running on a Mac unless it significantly increases performance.

No reason? If already own a Mac and want to add an additional speciality agent to do some specific, sandboxed work that is isolated from your main resources, it is a straightforward path to adding another 'worker' to the same machine/system. If the host mac is maxed out of isolatable compute then adding more compute engines does add performance.

Anyone who is a rigid CUDA purist isn't going to use Tiny's framework to get to an Nvidia GPU whether there is a eGPU involved or not. Not only are they not looking for eGPUs they are not looking for multiplatform solutions at all.
However, some models are 'trapped' in the CUDA walled garden. If there is a specialist model that is trapped in CUDA then a mutiplatform framework is what you need is going to use a Mac as a main host.

Does Apple has a very easy to use , agent sandboxing controls. Not right now, but they have been trying to sandbox the standard MacOS environment for a number of years now in some fundamental ways. They have a decent foundation to start from. ( folks running OpenClaw locally on a Mac as a admin user is likely a security dubious move. ) . For example, hand an agent a clone, snapshot of a subset of files. The agent may go off and destroy the clone , but just kill that snapshot (all without having to literally double the amount of storage to make a copy).


Hence I assume those trying it for performance reasons would be running the subset of compute tasks that don't need the high amounts of VRAM that Macs provide. I.e., for the enthusiasts that are seriously considering this, the VRAM disparity wouldn't be relevant since otherwise they wouldn't even be thinking about it in the first place.

If want to run multiple AI models then resources run out quicker. It could be a just incremental more VRAM you need.

Thus I think the main reason it doesn't make sense to do this is instead what I stated earlier: It's cleaner, more performant, and not much more costly, to just buy an NVIDIA PC:

If want a maximum , almost air-gapped environment then yes. If the interim work needs to be shared or collaborated on then it may not be as 'clean' as you are making it out to be.

More costly boils down to what you already. Folks who already have a Mac and an PCI-e enclosure the incremental costs here is much lower than buying a Linux/Nvidia PC from scratch. I don't think this is primarily aimed at selling new boxes as much as folks leverage the systems they already have and some incremental add to that. More so targeting enthusiants witht he hardware they have as opposed to the hardware Tiny wished they had. Ideally, Tiny wished they had Tiny hardware. 🙂

this framework main purpose is to get more Tiny hardware bought long term. They need more software written to their framework to make that grow faster. So getting the framework into the framework into the hands of some folks who might developm more solutions that needed higher compute scale builds business. Lots and lots of developers have Macbooks. It is a move to meet those folks where they are (and have 'sunk costs' already).


tinygrad isn't only ported to Macs. It is one of multiple paths to getting more solutions build for Tiny's systems. For a long time Apple had the "developer who needs Unix and a somewhat mainstream GUI/app" platform. There are number of folks who deal with gets thing up on one system and deployed on another. Apple approving drivers that help with that , really should be causing the 'surprise' that is spinning around this approval. Apple has to handing off some this as Windows grows more adoptive of LInux subsystem and Apple goes more lackadaisical about the Unix roots present and staying attached to things evolving there. ( e.g., 'discovering' RDMA about 7-10 years after most Linux/Unix platforms).
 
Last edited:
Thanks for clearing that up, and providing a link to the source of the "USB4/Thunderbolt" quote that deconstruct60 referenced!

A year ago, TB5 was not supported, according to the article cited by deconstruct60:

"Requirements for running an eGPU through a USB3 interface at this time include the use of an ASM2464PD-based adapter and an AMD GPU. For its tests, Tiny Corp used the ADT-UT3G adapter, which uses the same ASM2464PD chip, but out of the box, it only works with Thunderbolt 3, Thunderbolt 4, or USB 4 interfaces"
The ASM2464PD chip doesn't support TB5. (Confirmed by looking up its page on Asmedia's website.) That doesn't mean Tiny Corp's driver doesn't.

The Tiny Corp driver should exist at a layer in the driver stack where it shouldn't have to know or care whether the Thunderbolt link it's using to talk to the GPU is running at version 3, 4, or 5. It ought not to be directly interacting with the ASM2464PD at all; it should only be concerned with memory regions mapped to the GPU.

To expand on that last, even though PCIe is based on packets at the lowest level, it's fundamentally a memory mapped bus. That means if you're writing a device driver for a PCIe endpoint, you just read or write a "memory" address, using normal CPU instructions that would normally access RAM. Hardware responsible for routing memory accesses to where they need to go notices that the address lies inside a PCIe range, and forwards the access to the responsible PCIe bridge. The PCIe bridge automatically creates and transmits the appropriate PCIe memory request packet, deals with any response packet, and returns read data (if the access is a read). If TB is involved, those PCIe packets may get tunneled through Thunderbolt bridges and links.

None of the hardware responsible for routing PCIe and TB packets around needs software (driver) assistance beyond one-time setup when each device is configured. All the setup/config interfaces are constrained by the relevant specifications documents to be the same for all devices, so pure bus infrastructure chips like the 2464 TB-PCIe bridge are handled by generic universal drivers (which, in Apple's operating system, are naturally provided by Apple).

So, the only thing Tiny Corps' driver should be doing is asking Apple's driver stack to return a list of all PCIe devices with the PCIe vendor and device ID of the GPU it wants to use, then asking to receive memory mappings for one or more of them. After that's done, the driver doesn't need to make API calls into Apple's stack anymore, it just pokes around inside GPU memory. The speed of the several layers of hardware responsible for transparently bridging CPU loads and stores to PCIe memory request and response packets is very much Somebody Else's Problem.
 
Clearly, the focus is on AI/LLMs. However, I’m wondering and haven’t been able to extrapolate/interpret if the eGPU driver is capable of allowing any (or at least most other) CUDA compute applications, such as multi-GPU 3D rendering.

The real question is if we can stuff 4 Nvidia cards in 2019 MP.
If you have that much money, I don’t know why you’re thinking limitations. 😉 😂

On a serious note… From the hardware aspect, without any modding/Frankenstein-ing, you’d be limited to dual slot, <250W or (probably) less TBP* GPUs. More specifically:

RTX Pro 4000 Blackwell or lesser
RTX 5000 Ada or lesser
RTX A5000 or lesser
GeForce RTX 5070 or lesser
GeForce RTX 4070 (SUPER) or lesser
GeForce RTX 2080 or lesser

* the 2019 Mac Pro has a 1400W PSU with a noted 1280W at 108–125V/220–240V or 1180W at 100–107V capability.
 
Yeah I'm thinking undervolting / power capping whatever the thing is of 4 x RTX3090 turbos in a 2019 chassis will be useful for a whole bunch of stuff. I'm not worried about the 15% performance loss or whatever. I can go outside and walk on grass while I wait for an answer.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.