Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
Apple didn’t refresh the Mac Pro from 2013 to 2019. The 2019 one is getting lapped by Apple’s own laptops. There’s no way many professionals are sticking with it unless the absolutely have to.
why u think they are transitioning also the Mac Pro now away from Intel to Apple Silicon?
 
Right, if you need longer reach (more physical distance between die) and/or more die-to-die links, you'll eventually be forced to go with SERDES to get more bandwidth out of each off-chip wire. SERDES always add latency. There's well-known ways of mitigating this; if you want an example, look up what Intel has disclosed about its CSI/QPI/UPI interconnect for multi-socket systems. Intel has to use SERDES there because their comm links are off-package and potentially even off-board, not just off-die, so there are a few orders of magnitude fewer signals than 10K, several orders of magnitude longer distance, and connectors. They also have to support all kinds of other stuff needed to scale up to hundreds or thousands of nodes.

What kind of bandwidth and what kind of overhead are realistically achievable with current SERDES solutions?


What interests me is how big a system-on-package can Apple build with these extremely short reach edge-to-edge links they've demonstrated with M1 Ultra. You can't build an arbitrarily big system this way, it can't possibly scale out to huge systems like Intel's interconnect does, but Apple doesn't need that. They only need enough scaling to cover the existing Mac workstation market.

Yeah, I think this is the crucial point.
 
Those things are done on the cloud now, which is what is being discussed. There is almost no advantage to run a science simulation on a local desktop over the cloud.

Right now, no one serious is buying a Mac Pro to run scientific simulations.

End of discussions.
Could you please stop insisting to know other people's requirements and best practices better than those people themselves?

And, even if you are running massive simulations on massive cloud servers, these have to be implemented, tested and fine-tuned first, and for that you'd want to work on a local machine without the large turn-around time of deploying every little change to a remote machine.
 
It's true that a NoC (network-on-chip) is quite different from ethernet. However, general networking concepts still apply. You can just throw in something broadly similar to an ethernet switch. Even a single die couldn't function at all without switches.
...which is why I said that a switch was one conceptual approach to solving the problem.
For M1 Ultra, Apple's die-to-die interconnect solution appears to be simple brute force. They claim that ~10000 interconnect signals provide 2.5 TB/s bandwidth. 2.5e12 * 8bits/byte = 20e12 bits per second. 20e12 bps/10e3 wires = 2e9 bits/s per wire. That 2 GHz frequency plus the extremely short reach of the interconnect suggests that Apple didn't bother with SERDES. Instead, they're taking advantage of having so many signals by just running their on-die NoC protocol through the wires connecting the die together, and probably at the standard on-die NoC frequency.
If there's really no frequency change, it may be as simple as a flop just before each off-die transmitter pin and a flop just after each off-die receiver pin, adding 2 cycles of latency when crossing die. Even if it's slightly more complex than that, this is how they were able to claim software can treat M1 Ultra as if it's just one giant SoC. Technically it's a NUMA, but the non-uniformity is so mild that software can safely ignore it.
I think you're right that they're running the on-die protocol, and this is exactly what I was saying- their current approach is "ignore the issue", but I'm dead certain that's not their long-term plan. Software *can* ignore it, and it's not a disaster, but you're definitely leaving some performance on the table if your chip and OS aren't smart enough to plan around that.
What interests me is how big a system-on-package can Apple build with these extremely short reach edge-to-edge links they've demonstrated with M1 Ultra. You can't build an arbitrarily big system this way, it can't possibly scale out to huge systems like Intel's interconnect does, but Apple doesn't need that. They only need enough scaling to cover the existing Mac workstation market.
I can imagine an Apple-style "I/O die" that is just interconnect on all four edges, with four Maxes hanging off it. That would even correspond to the rumored high-end core count. It would be more NoC and not necessarily much else, though it could also contain stuff like, for example, a controller for more slower memory (if they wanted to do tiered memory, to support large memory configs).

Thing is, that sounds like something that would scale even worse than the Ultra. And perhaps that's why they cancelled it - if indeed they did. We don't really know. Of course, if a lot of the work that's gone into the M2 has been uncore improvements, that could be exactly what they've been working on- fixing scaling for that case.

We'll know more soon - perhaps even tomorrow, though if that's the case, the new chips will *not* be N3b. Not enough time to go through production. They might be N4 though - as a die shrink it's quite plausible.
 
With modern sophisticated power control, it's not difficult to do that kind of peak limiting with firmware. NVidia's Quadro (workstation/scientific/server product line) GPUs already do it, because they go into systems with more need to control this than desktop PCs, and a hypothetical MPX-ified 4090 would too.
Sure, you could obviously build an appropriate card. But you'd have to do some engineering to make sure power was always under control, and that's exactly what I was saying in my post.
 
Software *can* ignore it, and it's not a disaster, but you're definitely leaving some performance on the table if your chip and OS aren't smart enough to plan around that.

Doesn't the same consideration apply to the existing Apple designs? Their CPUs are arranged in clusters with shared L2 cache for example. The scheduler and power management agents are already taking this into account (e.g. compacting threads on core clusters and powering down unused cores).

BTW, one of the patents I have linked describes a memory folding algorithm that will move physical RAM pages closer to the active clusters when fewer computational resources are needed.

I can imagine an Apple-style "I/O die that is just interconnect on all four edges, with four Maxes hanging off it. That would even correspond to the rumored high-end core count. It would be more NoC and not necessarily much else, though it could also contain stuff like, for example, a controller for more slower memory (if they wanted to do tiered memory, to support large memory configs).

Wouldn't this kind of solution be limited by the bandwidth of the active chip edge (i.e. the transfer between the I/O die and the SoC)? Personally, I am more curious about the possibility of using two active edges, e.g. something like this:


Code:
 RAM RAM   RAM RAM
 RAM SoC | SoC RAM
     ———   ———
 RAM SoC | SoC RAM
 RAM RAM   RAM RAM

No need for the I/O die (the SoC's are their own routing network), more opportunities for network balancing (data can be routed via any of the two respective neighbours depending on network status), lower amortised network distance and lower bandwidth requirements per active edge.

This of course will only work for up to four SoCs, but does Apple even need more? Scaling up to 64 cores can be achieved by introducing an additional P (or E) cluster on the SoC itself.
 
  • Like
Reactions: eldho
I wonder when Apple will make the jump from 4-core clusters to 8-core clusters like AMD did when going from Zen 2 to Zen 3.

What would be the advantage for Apple? The 4-core cluster offers good power management granularity and making a shared L2 for four cores is probably easier than for eight. Maybe one day in the far future, when transistors got even smaller. AMD is in a different situation since they want to maximise the number of cores per chip.
 
  • Like
Reactions: Andropov
What would be the advantage for Apple? The 4-core cluster offers good power management granularity and making a shared L2 for four cores is probably easier than for eight.
Can AMD change frequency and power of individual cores within a cluster, which they call CCX, while Apple can't?

Anyhow, the advantage would be faster inter-core communication and faster thread migration across a larger pool of cores.
AMD is in a different situation since they want to maximise the number of cores per chip.
They didn't change the number of cores per chip, from Zen 1 to Zen 4 they have 8 cores per CCD chip; from Zen 2 to Zen 3 they changed from two 4-core CCXs per CCD to one 8-core CCX per CCD.
 
Doesn't the same consideration apply to the existing Apple designs? Their CPUs are arranged in clusters with shared L2 cache for example. The scheduler and power management agents are already taking this into account (e.g. compacting threads on core clusters and powering down unused cores).
Of course. That's what I was saying earlier - Apple consistently goes for the smart optimizations, in an iterative fashion. Start basic, add smarts over time. If they're ignoring the NUMAness of the Ultra now (I think they are), that doesn't mean they will continue to do so in future iterations.
Wouldn't this kind of solution be limited by the bandwidth of the active chip edge (i.e. the transfer between the I/O die and the SoC)?
Absolutely, you'd be limited by bandwidth (and latency). That's the reason scaling is such a focus.
Personally, I am more curious about the possibility of using two active edges, e.g. something like this:
Code:
 RAM RAM   RAM RAM
 RAM SoC | SoC RAM
     ———   ———
 RAM SoC | SoC RAM
 RAM RAM   RAM RAM
(Sorry. The Forum software chops off quotes after a few lines. Not much I can do about that.)
No need for the I/O die (the SoC's are their own routing network), more opportunities for network balancing (data can be routed via any of the two respective neighbours depending on network status), lower amortised network distance and lower bandwidth requirements per active edge.

This of course will only work for up to four SoCs, but does Apple even need more? Scaling up to 64 cores can be achieved by introducing an additional P (or E) cluster on the SoC itself.
You're again running into physics. The problem is that you just don't have that much shoreline available. Look at a Max die shot (though I'm going from memory, so sorry if I get some details wrong). RAM lines basically cover two edges. (And that's before adding your hypothetical third CPU cluster - not to mention a matching GPU core or two as well, both of which would want more memory channels for optimum performance.) The top edge is the interconnect to the other Max in an Ultra. That leaves you only one edge for all your I/O - PCIe, video, everything else. If you take that edge for another interconnect... you've just lost all your I/O. Your chip is a deaf/mute genius. :)

Thus, my "I/O die" proposal (which isn't really an I/O die like in the Zen family). You get just one interconnect per CPU die, leaving enough shoreline for everything else you need, while the IO die has the smarts to switch traffic as needed among all the CPU dies.

BTW, note also that your proposed layout has, necessarily, multiple NUMA zones. For any given core, 50% of the other cores are a hop away and 25% are two hops away. This does in fact matter for performance (ask owners of the first generation EPYCs!), though probably somewhat less than with other architectures given the ridiculous speed of the interconnect.
 
Last edited:
That leaves you only one edge for all your I/O - PCIe, video, everything else. If you take that edge for another interconnect... you've just lost all your I/O. Your chip is a deaf/mute genius. :)

Bah, who needs I/O, right? But yes, I completely forgot about that 😁
 
Central I/O die with all four edges as interconnects for four SoCs, so a cross-formation...

This leaves three sides of each SoC open for connections, but...

Only one side of each SoC has a fully exposed edge, where the other two sides share a quadrant with a neighboring SoC...
 
Could you please stop insisting to know other people's requirements and best practices better than those people themselves?

And, even if you are running massive simulations on massive cloud servers, these have to be implemented, tested and fine-tuned first, and for that you'd want to work on a local machine without the large turn-around time of deploying every little change to a remote machine.
I will stop if you stop insisting that there are legitimate Mac Pro use cases that require 1TB of RAM but can't be replicated using the cloud.

Speaking as a software engineer (not a scientist), I can't think of how you can test simulations locally before deploying. You're not going to be able to test anything locally that requires a cluster or a super computer. The way these things work usually is that the code is hosted on a place like Github and the cloud cluster/super computer simply pulls from the Github repo. The only thing you do locally is write the code, which does NOT require 1TB of RAM to do. A Macbook Air M1 with 8GB of RAM is more than enough.

It's just weird. Scientific simulations are usually done on super computers running linux or at least extremely beefy cloud clusters. I've never heard of someone using a Mac Pro desktop to do any serious scientific simulations in 2023.
 
I will stop if you stop insisting that there are legitimate Mac Pro use cases that require 1TB of RAM but can't be replicated using the cloud.
Of course you can do almost everything in the cloud, but sometimes it's cheaper or more productive to do things locally.

I'm a bioinformatics methods developer, which means I do various things between theoretical computer science, software development, and genomics. I mostly use a 128 GB consumer desktop. When that's not enough, I start a single 512 GB cloud instance, pretending that it's a local single-user computer. I would like to have a 512 GB workstation, but that's not cost-effective in my use case.

The last time I checked, a 128 GB consumer desktop was $5k, a 512 GB workstation was $12k, and a 1 TB workstation was $15k. Memory itself is pretty cheap, but you have to pay a steep price premium for other hardware if you want to go beyond 128 GB. On the other hand, 1 TB cloud instances tend to be twice as expensive as 512 GB instances, because a 512 GB instance is often a half of an 1 TB instance. The higher your memory requirements are, the more cost-effective buying a workstation becomes over using the cloud.

I often travel for extended periods of time, and my productivity suffers from having to use a laptop. The cycle of writing some code, testing it with a large enough dataset, looking what happened, and adjusting something becomes slower and less convenient when I have to run the code in the cloud. Having a dedicated cloud instance running 24/7 would improve productivity, but it would also be really expensive. (And this is why the speculations of 96 GB memory in the M2 Max MBP look so promising.)
 
Those things are done on the cloud now, which is what is being discussed. There is almost no advantage to run a science simulation on a local desktop over the cloud.

Right now, no one serious is buying a Mac Pro to run scientific simulations.

End of discussions.
AfterEffects and working with massive 3d models does not work so good over the cloud.
 
I was going to ask here if Apple Silicon uses micro-ops, but I found this informative article for anyone else that was curious.
 
I was going to ask here if Apple Silicon uses micro-ops, but I found this informative article for anyone else that was curious.

Depends on what you mean with "micro-ops". Apple Silicon definitely can and will execute some instructions as multiple operations. Certain instruction sequences are instead executed as a single operations, and a certain subset of instructions is even "free" (as in, does not need any operation at all — I know, crazy, right?).

You can find details here: https://dougallj.github.io/applecpu/firestorm.html
 
Depends on what you mean with "micro-ops". Apple Silicon definitely can and will execute some instructions as multiple operations. Certain instruction sequences are instead executed as a single operations, and a certain subset of instructions is even "free" (as in, does not need any operation at all — I know, crazy, right?).

You can find details here: https://dougallj.github.io/applecpu/firestorm.html
Thanks! Always appreciate info.
 
I will stop if you stop insisting that there are legitimate Mac Pro use cases that require 1TB of RAM but can't be replicated using the cloud.

I will agree with everything you said regarding very high end, very large data sets when it comes to scientific simulations and what not; yes all that stuff is done on supercomputers mostly by the various National Laboratories.

However, content creation is a whole other animal and it simply isn't workable over the cloud except for maybe when it comes to rendering out 3d projects. Big ass server farms are great for that; but in reality the big boy studios have their own local sever farms. Particle simulations can take boat loads of RAM also.

One simply can't work on insanely large pixel dimensions without a **** load of RAM. And when I say large pixel dimensions I am talking much larger than 8k. Depending on the design and LED/screen locations/whether you count dead space the resolutions can just get stupid big.
 
  • Like
Reactions: AlphaCentauri
We'll know more soon - perhaps even tomorrow, though if that's the case, the new chips will *not* be N3b. Not enough time to go through production. They might be N4 though - as a die shrink it's quite plausible.

From the side-by-side picture Apple displayed in the M2 Pro/Max product introduction video the M2 Max is substantively bigger than the M1 Max is. Similar bloat that the M1 -> M2 only 3x bigger. Apple has gone in the complete opposite direction of a die shrink here.

M1 Max was a "too chunky" chiplet. This is even worse (in a scale past 2 die context). It makes sense for the laptops (and Mini and iPad-on-stick iMac). They are monolithic only solutions. But if that is level of "chunky" they were trying to do with 2 die and 4 die solutions , then it is not surprising the 4 die one was a 'miss' and got canceled. However, it also really doesn't make much sense to over couple the upper 'half' of the desktop line up to a laptop optimized die either.

The M2 Max die shot didn't have a UltraFusion connector in it at all. Apple as re-photoshopped the UltraFusion connector off the M1 Max side-by-side photo also. Could be introducing notion that their is a laptop "Max" that doesn't have a UltraFusion connector. ( that's too makes lots of sense of laptop 'Max' Mac sales much greater than desktop 'Max' sales. Potentially millions of connectors that don't connect to anything. Even more so when the die is in 'bloat' mode anyway. take that UltraFusion connector space and allocate it all to on-die logic ) . Or it is another "delight and surprise" move with photoshopped images.

M2 & M1 Pro has same memory bandwidth and max capacity. [ perhaps supply chain problems. ]

M2 Max same memory bandwidth, but does get 4 * 24 ( 96GB) max but only if buy the largest GPU core count ( $200 GPU core bump + $800 jump to 96GB). [ again perhaps supply chain problems. More expensive it is usually maps to fewer people buying it. ]
 
  • Like
Reactions: Xiao_Xi and dgdosen
From the side-by-side picture Apple displayed in the M2 Pro/Max product introduction video the M2 Max is substantively bigger than the M1 Max is. Similar bloat that the M1 -> M2 only 3x bigger. Apple has gone in the complete opposite direction of a die shrink here.

M1 Max was a "too chunky" chiplet. This is even worse (in a scale past 2 die context). It makes sense for the laptops (and Mini and iPad-on-stick iMac). They are monolithic only solutions. But if that is level of "chunky" they were trying to do with 2 die and 4 die solutions , then it is not surprising the 4 die one was a 'miss' and got canceled. However, it also really doesn't make much sense to over couple the upper 'half' of the desktop line up to a laptop optimized die either.

The M2 Max die shot didn't have a UltraFusion connector in it at all. Apple as re-photoshopped the UltraFusion connector off the M1 Max side-by-side photo also. Could be introducing notion that their is a laptop "Max" that doesn't have a UltraFusion connector. ( that's too makes lots of sense of laptop 'Max' Mac sales much greater than desktop 'Max' sales. Potentially millions of connectors that don't connect to anything. Even more so when the die is in 'bloat' mode anyway. take that UltraFusion connector space and allocate it all to on-die logic ) . Or it is another "delight and surprise" move with photoshopped images.

M2 & M1 Pro has same memory bandwidth and max capacity. [ perhaps supply chain problems. ]

M2 Max same memory bandwidth, but does get 4 * 24 ( 96GB) max but only if buy the largest GPU core count ( $200 GPU core bump + $800 jump to 96GB). [ again perhaps supply chain problems. More expensive it is usually maps to fewer people buying it. ]

432mm² to 475mm² is not chunky.

Broadcomm have combined chips that are much larger with TSMCs CoWos interposer for a 1700mm² monstrosity at twice the reticle limit with current equipment. Perhaps Apple was trying to combine bigger chips for the Mac Pro?

Regarding memory bandwidth I don't think the M2 Max SoC is going to be limited by it anyway, so no reason to change it when they have the supply chain locked down for the current LPDDR5-6400.

LPDDR5X-8533 is running 33% faster so that would give 546GB/s at 512-bit or 273GB/s at 256-bit memory channel width. It is also a lot more expensive right now.
 
Of course you can do almost everything in the cloud, but sometimes it's cheaper or more productive to do things locally.

I'm a bioinformatics methods developer, which means I do various things between theoretical computer science, software development, and genomics. I mostly use a 128 GB consumer desktop. When that's not enough, I start a single 512 GB cloud instance, pretending that it's a local single-user computer. I would like to have a 512 GB workstation, but that's not cost-effective in my use case.
This is what I've been arguing for the whole time.

It makes little economic sense to buy a 1TB RAM Mac Pro for local work that can be done faster and cheaper via the cloud.

I'm going to mark you as someone who agrees with me.
 
Last edited:
I will stop if you stop insisting that there are legitimate Mac Pro use cases that require 1TB of RAM but can't be replicated using the cloud.

Speaking as a software engineer (not a scientist), I can't think of how you can test simulations locally before deploying. You're not going to be able to test anything locally that requires a cluster or a super computer. The way these things work usually is that the code is hosted on a place like Github and the cloud cluster/super computer simply pulls from the Github repo. The only thing you do locally is write the code, which does NOT require 1TB of RAM to do. A Macbook Air M1 with 8GB of RAM is more than enough.

It's just weird. Scientific simulations are usually done on super computers running linux or at least extremely beefy cloud clusters. I've never heard of someone using a Mac Pro desktop to do any serious scientific simulations in 2023.
In most cases, the program using 1TB of RAM reads huge input data. If you need to upload that data to cloud, this may take a lot of time. Such workflow is very inefficient. This is one of the reasons we can't use cloud efficiently in my company and we continue using a lot of local compute resources. Cloud comput8ng is also very expensive for many use cases.
 
  • Like
Reactions: Basic75
In most cases, the program using 1TB of RAM reads huge input data. If you need to upload that data to cloud, this may take a lot of time. Such workflow is very inefficient. This is one of the reasons we can't use cloud efficiently in my company and we continue using a lot of local compute resources. Cloud comput8ng is also very expensive for many use cases.
It's extremely rare in the big data world to have data stored locally rather than in the cloud. What work load are you referring to? Examples?

If you have any cloud setup, you usually work with data that is already stored in the cloud, ie via AWS S3 or some kind of database.

Science simulation is one area that I believe is extremely cost prohibitive to do serious work on locally. The forum members here are talking about spending $40k to get a 1TB Mac Pro... You can rent an AWS server with 24TB of RAM and 448 CPU cores for a fraction of the cost by the hour.

It makes zero sense to buy a $40k Mac Pro to do science simulation on. Period. No one here has come out and claim they use 1TB of RAM on a Mac Pro to do real science simulations. I don't expect anyone to.
 
Last edited:
432mm² to 475mm² is not chunky.

Broadcomm have combined chips that are much larger with TSMCs CoWos interposer for a 1700mm² monstrosity at twice the reticle limit with current equipment. Perhaps Apple was trying to combine bigger chips for the Mac Pro?

Apple didn't use CoWos on the Ultra. InFO-LSI is/was at the reticle limit. So having two 400+ mm2 is relatively chunky unless switch tech. Yes, Apple would likely need something much better for the "4 Max" set up ( if it even worked due to the awkware memory lay out) , but the package roll out is different.

500-600 is pretty much gonig to limit the scaling to just two even with 17000mm2 space. if can't scale then the chiplets are chunky.

Broadcomm is replicating what disaggregation on that package? . Or pulling the HBM RAM onto the package ( RAM stacks are not 'chiplets'). Two SoCs run in fault tolerant parallel mode on a package each with their own RAM store isn't particularly 'chiplets' either.

There was an Intel mega package that put two server dies in one package.



Intel-Xeon-Platinum-9200-Processor-Overview.jpg


The Xeon SP gen 4 ( Shappire Rapids ) uses a largish, chunky dies

"... With a multi-die design, you can ultimately end up with more silicon than a monolithic design can provide – the reticle (manufacturing) limit for a single silicon die is ~700-800 mm2, and with a multi-die processor several smaller silicon dies can be put together, easily pushing over 1000mm2. Intel has stated that each of its silicon tiles are ~400 mm2, creating a total around ~1600mm2. But the major downside to multi-die designs is connectivity and power. ...
"

there are accelerators on each tile/chiplets. If you are using the accelerators all the time that's fine. If not using the accelerators at all then each tile brings 'deadweight' die area utilization. That not a particularly good chiplet design from that perspective. Part of 'chucky" is how much 'dead weight' are you bringing.

8 Thunderbolt controllers is too many in the two die max context ( two secure elements and redundant SSD controllers is likely too many also). If try to scale that up to four it is a substantive waste of space.


Multiple Chip packaging existed before chiplets. It is not exactly the same thing. Everything that uses 3D ( or LSI 2.5D) packaging is automatically being tagged as chiplets is a bit of over reach.


Regarding memory bandwidth I don't think the M2 Max SoC is going to be limited by it anyway, so no reason to change it when they have the supply chain locked down for the current LPDDR5-6400.

It is limited somewhat because Apple boosted the size of the system cache. (mentioned in the overview). Which is contributing to the size increase both now at TSMC N5 and later on if they stick with system cache on main computational die. Not going to shrink any time soon even with new next gen fab processes.

But yes. die component costs play a role in the design choices Apple is making with the M-series SoCs. The latest, greatest won't always make the cut.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.