(I'm not a chip designer.) I wonder if the M3 Ultra could have the ultra connection on both sides allowing a 3+ configuration.
Three equally sized 'max-like' sized , relatively large dies ( ~400mm^2 ) , probably not.
The TSMC packaging technology that Apple uses (InFo-LSI) caps out at about reticle limits ( ~800). 3x 400 is likely too big. [ There is a another packaging technology at TSMC of CoWoS but all of that capacity is gone. The AI mania has consumed it all. And skeptical that Apple would get into a esculating price bidding 'war' to try to wrestle capacity away from others with lots of even fatter profits margins to throw at it. Also even more skeptical that Apple is trying to make most expensive solution possible ( as the RAM/SSD capacity charges are already quite high). ]
Apple could do something like decompose the I/O off the die and have more compute focused. Really do chiplets and decompose into different sizes. The 'problem' with the M1-M3 Max is that the 'top' is full of I/O functionality which pragmatically needs to be on the edge of the die. If subsitute in UltraFusion where did the I/O go?
[ I/O die ] -- PCI-e , Thunderbolt , USB , possible display controllers. etc. ( Ultra Fusion only one side)
[ Compute die ] -- P core clusters , E core clusters , NPU , GPU , AV fixed function and Memory controllers (and system cache). ( UltraFusion two sides ; 'top' / 'bottom' )
The Compute die (CD) would be incrementally smaller than a Max ( minimally prunning off I/O . Can also prune off some GPU clusters if want even smaller (e.g., old M1 Pro vs old M1 Max) . The I/O would be a lot smaller and doesn't necessarily have to bleeding edge fab process ( but to minimal gap between laptop version of Max can keep the same. )
Then could do something like
[ I/O die ]
[ CD ] == A 'Desktop' Max-like. Consistent 6 Thunderbolt ports on all Mac Studios.
[ I/O die ]
[ I/O die ]
[ CD ] == A 'Desktop' Ultra.
[ CD ]
[ I/O die ]
[ I/O die ]
[ CD 1 ]
[ CD 2 ] == A 'Desktop' More-than-Ultra ( but likely pushing the limits of Info-LSI )
[ CD 3 ]
[ I/O die ]
The distance between the middle Compute die and I/O dies is longer ( more latency) , but I/O off the package is going to have higher latency anyway (relatively it is already much slower than the intra compute-compute traffic) . Also not scaling up I/O past two, so the intra-chip-data-network bandwidth pressure isn't changing between configurations.
Probably will loose a some traction with the gap between CD 1 and CD2 . If the CD's are 'as big a possible' even more so. If a bit smaller than a Max then a bit less so.
[ Apple could manage this by only turning on the GPU clusters in adjoining pairs (e.g., only CD1-2 or CD2-3 ) when workload isn't excessively 'embarrassingly parallel'. ]
But adding another compute die to the 'layer cake' would likely start to introduce latency problems.
[CD 1 ]
[CD 2 ]
[CD 3 ]
[CD 4 ]
CD 1 and 4 are relatively much farther apart (e.g., their respective pools of memory are farther away, so getting data takes longer and there is more traffic clogging the intra 'data highway'). Also pushing the I/O dies further away from 2 and 3. Scaling in only one direction means things get farther and farther apart. It also get more and more expensive to make with LSI interconnects.
The M3 Pro is a bit 'dialed back' in terms of memory bandwidth so if the desktop Max-lite came in a bit lower than laptop Max that probably wouldn't impact putting a gap between Mini M3 Pro and Studio Desktop Max.
Costs amortized over all of Mac Studio and Mac Pro sales would spread out the costs. A relatively large 'fork' off the Max just for 'Ultra Studios' and 'Mac Pros' is likely dubiously low volume. Apple could make some exotic , even more expensive stuff than the above ... but then who is going to buy it? Not sure why Apple would want to scale past 3 if looking at return on investment. Pretty good chance it isn't there.