It could be 'off die' but it would still be inside of Apple's package. Amazon's Graviton 3 doesn't put the PCI-e or Memory controllers on the "compute" die.
https://www.nextplatform.com/2022/01/04/inside-amazons-graviton3-arm-server-processor/
Amazon AWS Graviton3 C7g instances are now in GA and are built upon a 3 CPU per motherboard design which is very unique
www.servethehome.com
AMD did a different disaggregation where they took the memory and L3 cache off and left the PCIe and displayPort output on the main die.
IMHO, Apple does need a better chiplet design strategy for desktop ( minimally the Studio , Mac Pro and perhaps a large screen iMac with performance ) that laptops probably won't want to use. A 'desktop' Max , Ultra (which only will ever be desktop) , and a > 2 compute chiplet "extreme".
Does Apple have a chiplet design strategy at all?!
Replicating past 6 Thunderbolt controllers gets into the certifiably silly zone. Past 4 is dubious. The Mac Studio putting 1 or 2 on the front is useful if doing a decent amount of plug/unplug activity (instead of reaching around the back). So there is small "get out of silly jail" card there. Multiple secure elements , SSD controllers are relatively silly to once get past two 'chiplets'. The Max die is a relatively too chunky 'chiplet'.
Also, it doesn't make tons of sense to put the general I/O functionality on TSMC N3 (and better ) either because the off package external communication lanes are not going to scale well either. So they cost more for no good reason. Sooner or later Apple will probably decouple a subsection with PCI-e in it from the compute cores. I doubt Apple will decoupled the CPU/NPU/GPU cores from one another though. (i.e. Apple start making CPU-less or GPU-less packages). Very long term it is probably coming. Is is coming for M2 generation. Not sure.
Conceptually Apple could buy an baseline PCI-e controller design 'off-the-shelf' but with CXL it needs to integrated with the internal cache coherency implementation is. Unless there is some huge security model and coherency model mismatch between Apple's internals and the standard PCI-e + CXL external model, that shouldn't be ridiculously expensive.
It is more lane bundle breadth that I think will be more a problem with Apple than them moving along the basic PCI-e vN upgrade train. I suspect they'd like to feed just one x16 PCI-e v5 bundle to a PLEX PCI-e switch to dole out 8 PCI-e v3 lanes worth of slots than to do two x16 PCI-e v4 lanes (and shrinking back from the Intel W-3200 64 PCI-e v3 lanes). Chasing wider , aggregate LPDDRx memory lanes just being a higher priority. Move the PCI-e data off the package and then "expand" it by branching out.
Faster Wi-Fi modules are coming too ( WiFI 7 and whatever else follows. So point connections will trickle down the rest of Mac line up also. )
The 'cost' problem is more so that they are not going to sell relatively many "Mac Pro only" PCI-e controllers if they try to chase after the upper bleeding edge going forward ( chasing PCI-e v6 in same 18-month window as server SoCs from Amazon , Intel, AMD will. ) They just are not going to make that many as those other players.
This is eye opening.
“Chiplets” seems like a
very smart way of being able to improve “blocks” of an SoC in-line with the most recent technology advancements, without waiting for a years-long complete redesign of the entire die.
I agree that Apple shouldn’t decouple the CPU, NPU, ML, GPU cores from the die unless — UNLESS — the performance and clock speeds of, say, GPU cores are being held back by the much slower clock speeds that CPU cores can only run at.
If Apple could design a GPU-only IC that could run at 10 GHz — and the required I/O memory bus/data bus is sufficiently fast, I say, Go for it!
Especially if more and more General Purpose instructions can be performed on a GPU instead of the CPU. (Apple needs to strive much harder to find more and more GPGPU optimizations. Linux is way ahead of Apple in this pursuit.)
But I suspect Apple doesn’t want a situation where an Apple Silicon SoC design changes every six months, much to the confusion/frustration of developers and even Apple’s own OS/SDK engineering teams. It’s already the case that
two years after the M1s release, MacOS software that claims it’s optimized to run on the M1 isn’t
reeeeeeally as optimized to run on the M1 as it could be. (Including even Apple’s own core apps like Final Cut Pro.)
Frequent changes to Apple Silicon via regular “chiplet” improvements might present an ever ”moving target” that developers will be demotivated to code specifically for, knowing that a fundamental part of the architecture might change in 6 months.
That’s why the burden on Apple’s own OS software engineers should be
so high. If hardware abstraction is strictly adhered to, Apple’s OS (and SDK) software engineers can make changes to the underlying OS — in-line with changes to the underlying Silicon — such that existing codebases can simply inherent Silicon and OS performance improvements automatically, without the need for developers to so much as
recompile existing apps.
I do realize that this is how it works already; I’m calling for even greater dedication on Apple’s part. Rosetta 2, for example, was Apple’s “Moon Shot” that Microsoft can only DREAM of ever accomplishing.
(As I understand it, Apple Rosetta 2 engineers even found plenty of Intel instructions that needed
no translation at all to run on ARM.)
Rosetta 2 might have led to a lot of Apple engineer burnout, but Apple needs to find a way to motivate engineers to be as devoted and dedicated to “impossible” feats like this again.