Thanks.There is no Extreme CPU from apple.
Lets do some conjecturing.
If the 40 gpu core M4 Max is close a RTX 3090, and assuming the M5 Max has the same 40 core count with a 30% improvement. We're looking at a 4080, maybe a 4090 class GPU.
The as yet unannounced M5 Ultra will conceivably have 2x the gpu cores, so we should see nearly double the performance, which does put us in the 5090 range.
This is just fast and loose napkin math type of conjecture, don't hold me to this given that we don't even have a M5 Pro, never mind, Max or Ultra.
Yes, natively running apps, or games, not using crossover. If you take the benchmarks for the existing apple silicon chips, you can extrapolate a possible scenario where the M5 Ultra is on par with a 5090 - again its conjecture since we don't have a M4 Ultra, never mind an M5 UltraThanks.
You said the M5 Ultra could match the RTX 5090 but would that also be in Mac native games?
According to ChatGPT, an RTX 5080 (not RTX 5090) gets around double the fps at 4K native ultra settings in Resident Evil 4 or Village (can't remember which) when compared to an M3 Ultra. Does this bode well for the M5 Ultra to match the RTX 5090? ChatGPT could be wrong about the fps numbers though.Yes, natively running apps, or games, not using crossover. If you take the benchmarks for the existing apple silicon chips, you can extrapolate a possible scenario where the M5 Ultra is on par with a 5090 - again its conjecture since we don't have a M4 Ultra, never mind an M5 Ultra
I would say it depends. For a lot of gaming titles, even native ones, I suspect an M5 Ultra might struggle to get close to a 5090's performance. But to give you a concrete example of one where it is very likely to get close: Blender (which is ray tracing + raster). Here are some illustrative scores to underpin a back-of-the-envelope calculation:According to ChatGPT, an RTX 5080 (not RTX 5090) gets around double the fps at 4K native ultra settings in Resident Evil 4 or Village (can't remember which) when compared to an M3 Ultra. Does this bode well for the M5 Ultra to match the RTX 5090? ChatGPT could be wrong about the fps numbers though.
| Apple M3 (GPU - 10 cores) | 919.06 |
| Apple M5 (GPU - 10 cores) | 1734.16 |
| Apple M3 Ultra (GPU - 80 cores) | 7493.24 |
| NVIDIA GeForce RTX 5090 | 14933.08 |
Doubtful.
One is a dedicated GPU that costs thousands of dollars. The other is primarily a CPU, with integrated GPU cores for far less money.
Edit: I was just laughing after noticing that a RTX 5090 cost MORE than my MBP.
Unfortunately, you can't really compare raw FP32 TFLOPS across different architectures and come up with a relative theoretical performance that is meaningful. I don't even mean from the perspective that drivers, software optimization, etc ... matter. Rather, even just at a fundamental hardware level, the way different architectures access their FP32 units are different enough that by itself is enough to skew results past the point of utility.According to this article (https://tzakharko.github.io/apple-neural-accelerators-benchmark/), a hypothetical M5 Max GPU's theoretical performance is 18 FP32 TFLOPS. This assumes a GPU clock speed for the M5 GPU cores (which Apple does not publish) of 1750 MHz.
As "Extreme" is typically used, it means 2x Ultra = 4x Max.
This woud give 4 x 18 FP32 TFLOPS = 72 FP32 TFLOPS for a hypothetical M5 Extreme. By comparison, a 5090 desktop is 105 FP32 TFLOPS (https://www.techpowerup.com/gpu-specs/geforce-rtx-5090.c4216)
So it appears the raw theoretical performance of a hypothetical M5 Extreme would be only ≈70% of a 5090 desktop.
But even if so, such a difference wouldn't necessarily be reflectected in their relative real-world performance scores. That depends on the particular optimizations used by Apple vs. NVIDIA, as well as the optimizations used by the software developers whose applications you are using for the comparison.
For instance, over the past few years it's been found that AAA games that are written natively for both AS and NVIDIA tend to run better on NVIDIA when evaluated using comparable GPUs. That's likely because developers have a much longer history optimizing AAA games for NVIDIA.
I would say it depends. For a lot of gaming titles, even native ones, I suspect an M5 Ultra might struggle to get close to a 5090's performance. But to give you a concrete example of one where it is very likely to get close: Blender (which is ray tracing + raster). Here are some illustrative scores to underpin a back-of-the-envelope calculation:
Apple M3 (GPU - 10 cores) 919.06
Apple M5 (GPU - 10 cores) 1734.16
Apple M3 Ultra (GPU - 80 cores) 7493.24
NVIDIA GeForce RTX 5090 14933.08
Based on this, the projected M5 Ultra score would be ~14138, roughly within 5% of a 5090. Now we don't know if the M5 Ultra will hit that (or get more depending on Apple's design for the M5 Max/Ultra), but the Blender GPU benchmark ranks amongst Apple's best performing GPU benchmarks. So again, in a lot of games, even native ones, an M5 Ultra may not hit this level of relative performance. But it is still likely to be closer than Apple has ever gotten before to a (near) contemporaneous top Nvidia GPU (also unknown when the M5 Ultra will launch, it could be awhile, hopefully less time than the M3 Ultra took).
We have to see what performance gains we get with TSMC new packaging tech that the m5 max and pro are set to use.
Doubtful.
One is a dedicated GPU that costs thousands of dollars. The other is primarily a CPU, with integrated GPU cores for far less money.
Edit: I was just laughing after noticing that a RTX 5090 cost MORE than my MBP.
And again, that's not an Extreme, that's an M5 Ultra coming close to matching the 5090 (in theory). Of course this relies on the M5 Ultra scaling the same and, as aforementioned both by you and myself, this is not expected to be the case in gaming for lots of reasons.
| Apple M3 Ultra (GPU - 80 cores) | 7494.03 | 8x |
| Apple M3 Ultra (GPU - 60 cores) | 6505.79 | 6x |
| Apple M3 Max (GPU - 40 cores) | 4238.72 | 4x |
| Apple M3 Max (GPU - 30 cores) | 3449.17 | 3x |
| Apple M3 Pro (GPU - 18 cores) | 1767.1 | 1.8x |
| Apple M3 (GPU - 10 cores) | 919.06 | 1x |
Oh I agree that's also possible that the M5 Ultra will scale even better depending on what Apple does and in my original post I noted the M5 Ultra could indeed do so. But yes I highlighted the inverse more strongly since we don't have M5 Ultras in front of us I wanted to make sure it was understood that this was a prognostication based on several assumptions which could be violated by unknowns and that the M5 Ultra may not reach this level (and yes could be better).The M5 likely will scale better than the M3. The 80 core version of the M3 Ultra didn't scale as a system. Likely runs out of memory.
Apple M3 Ultra (GPU - 80 cores) 7494.03 8x Apple M3 Ultra (GPU - 60 cores) 6505.79 6x
Apple M3 Max (GPU - 40 cores) 4238.72 4x
Apple M3 Max (GPU - 30 cores) 3449.17 3x
Apple M3 Pro (GPU - 18 cores) 1767.1 1.8x
Apple M3 (GPU - 10 cores) 919.06 1x
1.8 * 919 = 1654 ( versus 1767 ) (LPDDR5 clocked higher also)
3 * 919 = 2757 ( versus 3449 )
4 * 919 = 4238 ( versus 4238; almost undershoot. )
6 * 919 = 5514 ( versus 6505 ) [ 2 * 2757 = 5514 , 2 * 3449 = 6898 ]
8 * 919 =. 7352 ( versus 7494; undershoot) [ 2 * 4238 = 8476 ]
Indications are that 80 core variant is bandwidth starve. The raw compute there, but just can't get data to the units fast enough. ( the plain M3 unit is slower LPDDR5 memory so it also isn't getting Also some of this is UltraFusion getting 'old' relative to memory increases.
M5 comes with faster LPDDR5 also. on M3 iteration Apple wasn't that aggressive in following the bleeding edge of LPDDR ( could have traded off 'unknown' more expensive fab process for more affordable LPDDR ). Base M4/M5 both came with uplifts in lLPDDR clocks vs M3. ( e.g., Samsung has some custom LPDDR5X-10700). If Apple also cranks up the 2.5D connection between dies ("UltraFusion 2"), then they should be a substantive a bandwidth jump from the M3 Ultra era. Also possible issue if they didn't order/reserve enough memory early in advance before AI hypetrain bought it out. Skipping M4 Ultra means good chance they were sub, stantially planning ahead for the M5 Ultra needs.
Good points. Though what about 3D wafer-on-wafer configurations? When will we see them, and could those enhance performance by allowing shorter communication paths? E.g., suppose a processing unit on a monolithic chip most often talks to other units that, because of other constrains, can't be located near to it on the monolithic chip. With a wafer-on-wafer approach, if those other units could be postioned vertically above it, thus shortening the path length, could that enhance performance?TSMC's 'new' packaging tech is generally just the same as the packaging tech used for the UltraFusion. The pads/connections are smaller, which means conceptually get more of them between dies. However, the primary performance impact going to get is that the two (or more) dies act like the performnce of just using a monolthiic die. It isn't going to bring performance in and on itself. The construction of something bigger than one max sized die is where the performance is coming from ( not the connection).
Folks keep waving at this new tech as magical sprinkle power. The horizontal packaging variant simply allows Apple to do 'better' Ultras they have been doing. If Apple switches to smaller building blocks, chiplets, then connectivity is just getting back to the performance that the monolithic "Max" sized chips could do. Might end up more economical ( if defect savings outweight increased packaging costs ) , but the more so 'costs' than 'performance'. The more expesnive fab process might be more easily paid for that way. But again the fab process is the core source of the performance enablement; not the connection.
Also, the pattern you noted with the M3 Ultra is actually repeated in the M4 line as well, I looked pretty closely at the binned/full pro/max using the 3DMark benchmarks (and Blender too actually) - sometimes seemingly related to bandwidth issues, other times less clear (well one time anyway - one data point on the Steel Nomad Light benchmark makes no sense to me, but everything else including the full Steel Nomad bench results tracks okay). The clearest result was that the binned M4 Pro has the best bandwidth/TFLOPS ratio of the pro/max M4 lineup and the best scores per TFLOPS amongst those devices. Again, that's why I was being careful to make sure people understood this is just a rough estimation.
Good points. Though what about 3D wafer-on-wafer configurations? When will we see them, and could those enhance performance by allowing shorter communication paths?
E.g., suppose a processing unit on a monolithic chip most often talks to other units that , because of other constrains , can't be located near to it on the monolithic chip. With a wafer-on-wafer approach, if those other units could be postioned vertically above it, thus shortening the path length, could that enhance performance?
Of course, a big tradeoff with 3D designs is their reduced surface area and thus reduced heat dissipation. The again, with shorter average path lengths, would thermal energy generation be reduced (not sure about the relative contribution of electron flow between processing units vs. logic gate operation to thermals).
Separately, with 2.5D, the additional modularity might enable Apple to more easily offer Mac Studio/Pro variants with high GPU core counts.
The odd part is, the one and only time Apple bandwidth was tested to its max, pardon the pun, on the Max is to my knowledge Andrei with the M1 Max and try as he might, he could not saturate it or even get close to doing so despite running both CPU and GPU workloads simultaneously. Now that was the M1 Max, I haven't gone back to see its compute to bandwidth ratio, and I haven't seen any similar test done by anyone since (and unfortunately all the old Anandtech articles are now completely gone, unless you have a local version or can find them on the Internet archive). However, bandwidth is the cleanest explanation for what I see*.The "Extreme" meme has always pragmatically meant more expensive. That could end up more expensive subcomponent parts rather than more dies (Max sized or otherwise). At some point, Apple would need to shift from "poor man's HBM" to"not quite as poor man's" HBM.More SLC (e.g. attached 3D) and/or faster LPDDR than the 'poor man's' version, but still less than conventional HBM.
Apple seems to be regularly soaking up as much bandwidth as gets added. But they also are likely not to waste die area on boosting TFLOPS significantly higher than what they can actually get just for some spec bragging rights (even if the CPU cores were not there, grossly outstripping the memory system would probably get a pass for more affordable die size.). In terms of "Keeping up with the Joneses", lots of doubt about steady improvements versus "competitor X"'s improvement seem unjustified.
it is indicative though that Apple's primary design criteria is not out on the extreme "lunatic fringe" market. The biggest 'bang for buck' is in the more affordable part of the range. They can get a reasonably good top end model without obsessing about being 'king of the hill' on every benchmark. Make most of the apps that folks use faster , and end up with substantively better offerings each generation. It isn't about stalking a specific implementation from a competitor ( ' beat Nvidia x090 or bust' objective.)
The 4090->5090 raster improvements in sub 4K were (~1/2) smaller than 4K improvements. Some of the raster gen-to-gen improvements is just when the VRAM subsystem improves/evolves to the next gen. GDDR and LPPPDR don't move in 100% lock step ( e.g., LPDDR6 recently passed standards vote while GDDR7 passed last year) , so the x090 is a goofy stalking horse for that reason also.
Optimization takes time and money. It makes more sense to invest that for major platforms than for minor ones. On the average, it's reasonable to expect that computer/console games will be better optimized for Nvidia/AMD hardware than for Apple hardware, and the other way around for mobile-first games.For instance, over the past few years it's been found that AAA games that are written natively for both AS and NVIDIA tend to run better on NVIDIA when evaluated using comparable GPUs. That's likely because developers have a much longer history optimizing AAA games for NVIDIA.
Wafer-on-wafer means vertical die stacking, and UltraFusion doesn't stack the two dies vertically. [As the name indicates: It's one die on (top of) another, not one die beside another.]Apple is already using wafter on wafer with UltraFusion.
The odd part is, the one and only time Apple bandwidth was tested to its max, pardon the pun, on the Max is to my knowledge Andrei with the M1 Max and try as he might, he could not saturate it or even get close to doing so despite running both CPU and GPU workloads simultaneously. Now that was the M1 Max, I haven't gone back to see its compute to bandwidth ratio, and I haven't seen any similar test done by anyone since (and unfortunately all the old Anandtech articles are now completely gone, unless you have a local version or can find them on the Internet archive). However, bandwidth is the cleanest explanation for what I see*.
*again, except for that one niggling Steel Nomad Light data point where for some reason the full M4 Max lost performance pts per TFLOPS despite having the same (slightly better in fact) bandwidth per compute than the binned M4 Mx but didn't do so on any other benchmark - to be fair did not look at Blender - but still multiple other 3DMark benchmarks showed a very clear relationship between bandwidth and compute. Basically the binned Pro had the best ratio of bandwidth to compute and the best ratio of performance per compute while every other device had roughly the same bandwidth per compute and the same compute per bandwidth. It's possible that SNL data point is just noise I might've pulled these values from a single run rather than the median. When I get more time I'll have to double check (EDIT: doing a quick look and about half of the decrease in the full M4 Max relative to the binned does look noise, but not all, so a 5% decrease in performance per TFLOPS but only in SNL, all other tests show the expected relationship - i.e. the binned and full M4 Max showing the same performance per TFLOPS since they have almost the same bandwidth per TFLOPS with the full M4 Max being even slightly better).
Ah heck here's the data (apologies for the picture format and the split header/body, can't type it all out it right now and chart is too wide anyway):
View attachment 2579038
View attachment 2579035
And Blender where for some reason did not do the 32-core score:
View attachment 2579039
View attachment 2579037
One other interesting thing is of course a similar pattern emerges with Nvidia chips (no time to post it all), so it has occurred to me that at least some of the pattern I saw with the "dual-issue" design of the 3000 and 4000 series GPUs that I mention earlier was exacerbated by their lack of bandwidth per compute relative to the 2000 series of GPUs, which was partially but not totally improved by the 5000-series. That said ChipsandCheese also mentioned that they saw a similar pattern and attributed at least in part to core occupancy issues in their tests - also they may have been testing compute more directly avoiding any memory system bottlenecks, I'm not sure, so this particular relationship could be slightly coincidental. However, even if so, modern Nvidia chips have pretty bad compute to bandwidth ratios both to Apple and their own older Turing RTX 20XX cards which had very nearly similar bandwidth to compute to Apple and much more similar performance to compute as well.
One issue is that the M5 did double FP16 compute - I haven't been tracking that, but if an engine does use FP16 if that has any impact on bandwidth per performance more than the FP32 both since the FP16 doubled which will be hard for the bandwidth increases to make up for but also that, properly coded, the needed bandwidth for FP16 should be halved relative to FP32.
That is the binned version, but think of it this way: the binned Pro is only 60% bigger than the base and the base is meant for the iPad Pro and 13" Air as much as, if not more so arguably than, the mini/MBP/15" Air. Apple likes to double the GPU each tier which constricts what the binned Pro can reasonably be. Even if Apple leaves the relationship between Pro and Max the same, Apple would have to more than double the full Pro's core count over the base to change this. That would be nice, but so far Apple seems pretty set on their current structure.M4 pro only being 6 tflops is surprising
M4 pro only being 6 tflops is surprising