Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

thenewperson

macrumors 6502a
Original poster
Mar 27, 2011
992
913
At this point we've all seen the Bloomberg report about future ASi chips (refresher) which gets into a bit of detail about the core count variations of both the CPU and GPU that Apple is testing. But what do we speculate about other aspects of it? Will there be clock speed increases? Ray tracing hardware? Will the neural engine be the same across all? What kind of RAM will they use? How will the chips be packaged?
 
  • Like
Reactions: Ashka
We are still talking about an M1/A14 variant, so I wouldn’t expect any microarchitectural changes. Memory bus will be doubled to 256 bit, that’s almost certain, with twice as many memory controllers as M1 and four RAM chips instead of two. Packaging I’d expect to stay the same, just with two more RAM chips on the other side. No changes to neural engine from M1. Maybe LPDDR5. More thunderbolt controllers. That’s about it.
 
  • Like
Reactions: Captain Trips
Doubling the memory bus width would also double the memory bandwidth to 136 GB/s, assuming the memory technology used is still LPDDR4X. It'll be interesting to see if doing that alone will double the M1's thruput for all processing cores (i.e. CPUs, GPUs, NPUs, etc.)

Tests conducted by Anandtech on the M1 Macs shows that a single Firestorm core is enough to saturate the 68 GB/s bandwidth. So it would seem that the M1 Macs are severely bandwidth constrained with more potential yet to be unleashed?

With UMA, I would think that the M1's system interconnect fabric would have to implement some sort of fair share algorithm for each of the processing cores to prevent data starvation. So the 68 GB/s bandwidth provided by the 128-bit LPDDR4X memory could not be allocated fully to any single processing core's use.

According to Apple, the M1's GPU could perform 2.6 TFLOPS (presumably FP32). From my limited understanding, 68 GB/s is nowhere near enough to keep the M1 7/8 GPU cores fed to achieve 2.6 TFLOPS.

For iMacs and Mac Pros, I would think it's unlikely Apple will go with higher bandwidth memory, e.g. HBM2, as it'll be too cost prohibitive to implement for consumer products. What I think would be likely is that HBM2 or equivalent (costly) memory tech. will be use solely for the GPU, and DDR5/LPDDR5 will be used for main memory, with the GPU sitting on a separate die/board with it's own memory, but with custom circuitry to ensure memory coherency with main memory so as to preserve the UMA architecture. The 68000 Macs used to have proprietary bus slots (if memory serves) for such purposes, so Apple may go back to custom designs instead of using PCIe.

I am probably completely off tho.

Thoughts?
 
Doubling the memory bus width would also double the memory bandwidth to 136 GB/s, assuming the memory technology used is still LPDDR4X. It'll be interesting to see if doing that alone will double the M1's thruput for all processing cores (i.e. CPUs, GPUs, NPUs, etc.)

If you increase the number of processing cores, you have to increase the memory bandwidth. The GPU in M1 is already likely bandwidth-limited, if one wants to have 16 cores one needs to at least double the bandwidth.

Tests conducted by Anandtech on the M1 Macs shows that a single Firestorm core is enough to saturate the 68 GB/s bandwidth. So it would seem that the M1 Macs are severely bandwidth constrained with more potential yet to be unleashed?

With UMA, I would think that the M1's system interconnect fabric would have to implement some sort of fair share algorithm for each of the processing cores to prevent data starvation. So the 68 GB/s bandwidth provided by the 128-bit LPDDR4X memory could not be allocated fully to any single processing core's use.

That’s interesting, right? Bandwidth to individual cores is usually constrained, but not with Apple design. I wouldn’t say that M1 CPU is bandwidth constrained, more that it’s able to utilize all available bandwidth. As to what maximal bandwidth the internal fabric can support, we can only guess.

According to Apple, the M1's GPU could perform 2.6 TFLOPS (presumably FP32). From my limited understanding, 68 GB/s is nowhere near enough to keep the M1 7/8 GPU cores fed to achieve 2.6 TFLOPS.

I was able to get pretty much exactly 2.6 TFLOPS using long chains of fused multiply adds. The FP16 performance is identical to FP32 (which is a big difference to A14 that has half the FP32 throughout). As to bandwidth... no GPU or CPU has enough of it. The assumption is that you do a bunch of calculations between loads and stores or your ALUs are running empty.


For iMacs and Mac Pros, I would think it's unlikely Apple will go with higher bandwidth memory, e.g. HBM2, as it'll be too cost prohibitive to implement for consumer products. What I think would be likely is that HBM2 or equivalent (costly) memory tech. will be use solely for the GPU, and DDR5/LPDDR5 will be used for main memory, with the GPU sitting on a separate die/board with it's own memory, but with custom circuitry to ensure memory coherency with main memory so as to preserve the UMA architecture. The 68000 Macs used to have proprietary bus slots (if memory serves) for such purposes, so Apple may go back to custom designs instead of using PCIe.

I think we will see “real” unified memory. Enduring coherency as you describe is really complicated and so dint think design purists st Apple would be happy with it. Maybe not HBM, but multi channel stacked DDR5 (8 to 16 channels, should provide plenty of bandwidth). Abs yeah, it’s costly but still cheaper than buying Xeons. And Apple is the only company that can afford it :)
 
  • Like
Reactions: Captain Trips
Whatever the case, after the big gains of the M1 and the first models equipped with it, I hope an M1X is not going to be too underwhelming in comparison when they stick it in an iMac, when they just not quite push it far enough and the end results are very good, but not great. I hope Apple proves me wrong and really shows what it can do, with even the entry model being a more than capable machine.
 
Whatever the case, after the big gains of the M1 and the first models equipped with it, I hope an M1X is not going to be too underwhelming in comparison when they stick it in an iMac, when they just not quite push it far enough and the end results are very good, but not great. I hope Apple proves me wrong and really shows what it can do, with even the entry model being a more than capable machine.

Just remember there's a whole cadre of folks who'd complain about Apple giving away free cheese 'cuz they didn't also have crackers.

Whatever the next step is, someone's gonna complain about it.
 
At this point, I'd be most excited for them to fix the I/O, since almost all aspects of it are broken and not even close to par when compared to older Intel models...

Literally every aspect of it is broken in some way, USB is seemingly not correctly implemented, external drive compatibility is unpredictable and a low speed mess, TB3/TB4 are not properly implemented DP support wise, high-res monitor support is awful and multimonitor support non-existent and these are just issues that gathered 20+ threads on Macrumors.

Not to mention whatever is up with the insane amount of ssd drive writes...


Let's say fine, the current gen of M1 devices are consumer grade, entry level Mac devices. But this stuff needs to be fixed when we are talking about high-end devices used in professional settings.


Regarding the "M1X" I actually hope (but doubt) we get a M2X of sorts instead, with a generational improvement in the CPU/GPU cores, revised uncore areas and widened RAM buses
 
That’s interesting, right? Bandwidth to individual cores is usually constrained, but not with Apple design. I wouldn’t say that M1 CPU is bandwidth constrained, more that it’s able to utilize all available bandwidth. As to what maximal bandwidth the internal fabric can support, we can only guess.
I would think with Apple's experience with high performance system (i.e. Mac Pros, xServes, etc) their internal fabric would be designed to handle really high bandwidth. Like you said, Apple's pocket is deep enough to go really wild as far as SoC design is concerned.

I was able to get pretty much exactly 2.6 TFLOPS using long chains of fused multiply adds. The FP16 performance is identical to FP32 (which is a big difference to A14 that has half the FP32 throughout). As to bandwidth... no GPU or CPU has enough of it. The assumption is that you do a bunch of calculations between loads and stores or your ALUs are running empty.
Wow! I suspect tho. that what you saw probably are calculations performed wholely using the SoC's cache? FP32 are 32-bits long. Completing 2.6 TFLOPS with each data item 32-bits long means we need 10TB/s of bandwidth in steady state, not withstanding the other processing cores' need for memory bandwidth. I'm sure my calculation is over simplifying the scenario :) but I somehow think that simply doubling the bandwidth will double the M1's 8 GPU core's performance in real world use.

I think we will see “real” unified memory. Enduring coherency as you describe is really complicated and so dint think design purists st Apple would be happy with it. Maybe not HBM, but multi channel stacked DDR5 (8 to 16 channels, should provide plenty of bandwidth). Abs yeah, it’s costly but still cheaper than buying Xeons. And Apple is the only company that can afford it :)
I think you'll be right in that most likely they'll go with multi channel DDR5. I still can't reconcile how Apple will implement it for the iMacs and Mac Pros tho. Using soldered memory in notebooks may be fine, but I don't think Apple would want to manufacture a bunch of iMacs and Mac Pros with soldered memory and find themselves stuck with unmovable inventories. It'll be interesting to see how Apple is going to address this.

Also, I'm not sure if there are designs where the internal fabric are designed to connect to two types of memory controllers (i.e. HBM2 and DDR5/LPDDR4X). If possible, the fixed HBM2 memory will be used for the GPUs, while the slower DDR5/LPDDR4X could be via DIMM slots for the Mac Pros and iMacs, maybe even the 16" MBP. The drivers will have the smarts to delineate the memory regions for various processing cores' use. If anyone can do it, it'll probably be Apple.

At this point, I'd be most excited for them to fix the I/O, since almost all aspects of it are broken and not even close to par when compared to older Intel models...
I think and hope that the issue is likely driver related instead of the actual Silicon. Intel's CPU and their north/south bridges chipsets are mature with equally mature drivers, while the M1 has yet to be battle tested, so to speak. So I'm hopeful existing issues will be resolved with future Big Sur updates.
 
  • Like
Reactions: Captain Trips
Also, I'm not sure if there are designs where the internal fabric are designed to connect to two types of memory controllers (i.e. HBM2 and DDR5/LPDDR4X). If possible, the fixed HBM2 memory will be used for the GPUs, while the slower DDR5/LPDDR4X could be via DIMM slots for the Mac Pros and iMacs, maybe even the 16" MBP. The drivers will have the smarts to delineate the memory regions for various processing cores' use. If anyone can do it, it'll probably be Apple.

That could be how Apple replaces the models with dGPUs (16" MBP, iMac), but I don't think you'll see that type of setup in a MBA or sub-16" Pro. There's a reason dGPUs (even the ones used in Macs) are running GDDR instead of DDR, so I think that you'd still have the SoC with its CPU and GPU cores, then a dedicated GPU with its own memory that connects via a (likely proprietary) high-speed bus to the SoC.
 
Just got back intto investing. Might be time to buy AAPL again and some INTC puts. Intel is struggling to position themselves Apple’s first entry level mobile chipset. And they don’t seem to have a plan if the market more broadly moves to ARM.
 
Just got back intto investing. Might be time to buy AAPL again and some INTC puts. Intel is struggling to position themselves Apple’s first entry level mobile chipset. And they don’t seem to have a plan if the market more broadly moves to ARM.

I think Intel is actively looking at partnering with TSMC or even Samsung to produce their CPUs, since those companies already have sub-10nm processes in production. With the CEO shakeup in Santa Clara, Intel is quickly becoming more amenable to partnering up with third parties to shore up their production capacities. Unfortunately, their marketing teams still rely on misleading tests and benchmarks to try to downplay the performance of the M1, and even their comparisons to AMD and nVidia on the GPU side of things are comparing current-gen Intel to 3-4 year old parts from the competitor.
 
  • Like
Reactions: eltoslightfoot
I think you'll be right in that most likely they'll go with multi channel DDR5. I still can't reconcile how Apple will implement it for the iMacs and Mac Pros tho. Using soldered memory in notebooks may be fine, but I don't think Apple would want to manufacture a bunch of iMacs and Mac Pros with soldered memory and find themselves stuck with unmovable inventories. It'll be interesting to see how Apple is going to address this.

Also, I'm not sure if there are designs where the internal fabric are designed to connect to two types of memory controllers (i.e. HBM2 and DDR5/LPDDR4X). If possible, the fixed HBM2 memory will be used for the GPUs, while the slower DDR5/LPDDR4X could be via DIMM slots for the Mac Pros and iMacs, maybe even the 16" MBP. The drivers will have the smarts to delineate the memory regions for various processing cores' use. If anyone can do it, it'll probably be Apple.

Grossly simplifying, HBM2 is DDR with very wide data bus. The chips themselves are not very fast, but they have a lot of data being transferred simultaneously. A single HBM2 package is connected via 1024 bits, compared to 64 bits of DDR4...

Apple already stated that they are putting their money on wide memory interfaces. M1 is not really wide all things considered (it's 2x64bit bus or 4x32bit bus or 8x16bit bus depending on how you count), but we will undoubtedly see wider designs going forward. A Mac Pro with 16 DDR5 memory channels (16x64 bits = 1024 bits) would offer aggregate bandwidth of around 800GB/s, which is in the ballpark of AMD Radeon Pro Vega II.

But there are of course a lot of "but's" and "how's" around all this stuff. Very curious how they end up doing it.

That could be how Apple replaces the models with dGPUs (16" MBP, iMac), but I don't think you'll see that type of setup in a MBA or sub-16" Pro. There's a reason dGPUs (even the ones used in Macs) are running GDDR instead of DDR, so I think that you'd still have the SoC with its CPU and GPU cores, then a dedicated GPU with its own memory that connects via a (likely proprietary) high-speed bus to the SoC.

As you probably already gathered, I am very skeptical when it comes to dGPUs on Apple Silicon platforms. A dGPU with it's own dedicated memory would utterly shatter the elegant programming model Apple has built with their GPUs. As Iv'e been arguing for a while, a huge advantage of Metal on Apple GPUs is that all of them offer same base guarantees (unified memory, zero-copy buffers, low-latency heterogenous data access). A traditional dGPU with it's own memory will violate those guarantees, which means that people developers will need code separately for different GPU types in order to get best performance.

What I can definitely see is that GPU is on a separate die from the rest (and uses a high-speed bus to connect to the shared cache and memory controllers) — essentially the chiplet approach that AMD currently uses. This would increase the latency a bit and make the system package larger, but would allow for better yields and more flexible chiplet configurations. However, this is not a dGPU in the traditional sense, as it doesn't have it's own dedicated memory.
 
The 68000 Macs used to have proprietary bus slots (if memory serves) for such purposes, so Apple may go back to custom designs instead of using PCIe.

I am probably completely off tho.

Thoughts?
I’m gonna go off on a limb and say that Apple will use an evolution of the MPX slots, “compatible” with pci but with an extra slot for Apple’s own modules
 
Doubling the memory bus width would also double the memory bandwidth to 136 GB/s, assuming the memory technology used is still LPDDR4X. It'll be interesting to see if doing that alone will double the M1's thruput for all processing cores (i.e. CPUs, GPUs, NPUs, etc.)

Tests conducted by Anandtech on the M1 Macs shows that a single Firestorm core is enough to saturate the 68 GB/s bandwidth. So it would seem that the M1 Macs are severely bandwidth constrained with more potential yet to be unleashed?

With UMA, I would think that the M1's system interconnect fabric would have to implement some sort of fair share algorithm for each of the processing cores to prevent data starvation. So the 68 GB/s bandwidth provided by the 128-bit LPDDR4X memory could not be allocated fully to any single processing core's use.

According to Apple, the M1's GPU could perform 2.6 TFLOPS (presumably FP32). From my limited understanding, 68 GB/s is nowhere near enough to keep the M1 7/8 GPU cores fed to achieve 2.6 TFLOPS.

For iMacs and Mac Pros, I would think it's unlikely Apple will go with higher bandwidth memory, e.g. HBM2, as it'll be too cost prohibitive to implement for consumer products. What I think would be likely is that HBM2 or equivalent (costly) memory tech. will be use solely for the GPU, and DDR5/LPDDR5 will be used for main memory, with the GPU sitting on a separate die/board with it's own memory, but with custom circuitry to ensure memory coherency with main memory so as to preserve the UMA architecture. The 68000 Macs used to have proprietary bus slots (if memory serves) for such purposes, so Apple may go back to custom designs instead of using PCIe.

I am probably completely off tho.

Thoughts?
Interesting idea about memory coherency!
 
AMD does something like this for their hUMA (heterogenous Uniform Memory Access). hUMA | Tom's Hardware
I'm really interested to see what solution Apple comes up with to "replace" the traditional dGPU.

It may well end-up being something on a separate die (either on the SoC or off it) with a proprietary interconnect. I expect that something like hUMA or other shared memory solution will be used rather than going back to dedicated GPU VRAM.

GPU performance is actually the deciding factor for me upgrading from my current Intel MBP16. I really wanted a 13-14" screen, but at the time, the MBP13 with Intel iGPU just wasn't good enough for video editing. If a new MBP14 can exceed the current AMD 5600M in the BTO MBP16, then I would bite. This requires a roughly 2x improvement in current M1 GPU performance.
 
Remember that the DDR5 standard incorporates ECC, which is relevant given the speeds that current top tier memory reaches.
 
Whatever they do, they will keep the unified memory between the CPU and GPU. I do wonder if Apple will allow for additional GPU’s on top of this though - but I’m not educated in this to know how that would work with the unified memory architecture.

Does anybody know if they could implement a system where you could add additional GPU’s via something like a PCIe slot, but still have it connected to the unified memory?
 
I'm really interested to see what solution Apple comes up with to "replace" the traditional dGPU.

It may well end-up being something on a separate die (either on the SoC or off it) with a proprietary interconnect. I expect that something like hUMA or other shared memory solution will be used rather than going back to dedicated GPU VRAM.

GPU performance is actually the deciding factor for me upgrading from my current Intel MBP16. I really wanted a 13-14" screen, but at the time, the MBP13 with Intel iGPU just wasn't good enough for video editing. If a new MBP14 can exceed the current AMD 5600M in the BTO MBP16, then I would bite. This requires a roughly 2x improvement in current M1 GPU performance.
Considering that the GPU in the M1 is almost the same speed as a 5500M then double the number of GPU cores should easily top the 5600M.
 
  • Haha
Reactions: crevalic
I expect that something like hUMA or other shared memory solution will be used rather than going back to dedicated GPU VRAM.

Cache-coherent approaches like hUMA (which if I understand correctly are present in all modern dGPU systems) are using dedicated GPU RAM - there is just additional machinery in place to make CPU/GPU memory synchronizaction easier from the programmer point of view. The data still has to be copied over the PCIe bus, so latency is terrible. I don’t see Apple using this inferior technology when they have been promising true unified memory.

GPU performance is actually the deciding factor for me upgrading from my current Intel MBP16. I really wanted a 13-14" screen, but at the time, the MBP13 with Intel iGPU just wasn't good enough for video editing. If a new MBP14 can exceed the current AMD 5600M in the BTO MBP16, then I would bite. This requires a roughly 2x improvement in current M1 GPU performance.

Apple Silicon has an inherent advantage for video editing exactly because of unified memory. The end result is a slower GPU outperforming a faster dedicated one in real world task since it can achieve much better resource utilization.


Considering that the GPU in the M1 is almost the same speed as a 5500M then double the number of GPU cores should easily top the 5600M.

M1 is not anywhere near 5500M - it’s about 30-50% slower. The 16- ore variant should match or outperform the 5500M though, which would be a great performance level for a 14” laptop. To get to 5600M levels Apple will need around 32 cores and some serious memory bandwidth.
 
M1 is not anywhere near 5500M - it’s about 30-50% slower. The 16- ore variant should match or outperform the 5500M though, which would be a great performance level for a 14” laptop. To get to 5600M levels Apple will need around 32 cores and some serious memory bandwidth.
Do they really need up to 32 cores? Thought it'd be in the ballpark with between 16-20?
 
Do they really need up to 32 cores? Thought it'd be in the ballpark with between 16-20?

Well, a single Apple GPU core is roughly the same as 2 AMD Navi CUs (both contain 4 32-wide ALUs). To match the 5500M (24 CUs) "on paper", we'd need 12 Apple cores running at the same frequency (1300Mhz), to match the 5600M (24 CUs), we'd need 20 Apple cores running at the same frequency (1035 Mhz). The GPU in the M1 seems to run at around 1270Mhz (two FLOPS per ALU per clock * frequency * 1024 ALUs = 2.6TFLOPS), so very close to the 5500M already...

So yeah, you are right, based on this overly simplistic napkin math, a 20-core Apple GPU running at the same 1270Mhz should theoretically be able to match the 5600M in performance, if the rest scales linearly.

P.S. Looking at these numbers, I am again really impressed about how energy-efficient Apple GPUs are. 2560 ALUs = 25 watts (without memory of course). That's the same amount of ALUs as in a 215W Nvidia RTX 2070 Super. Of course, we need to keep in mind that the Nvidia GPU runs 40% faster clock and has limited superscalar capabilities (it can execute a FP32+INT operation per clock, Apple can't do that), so it's going to be considerably faster, but not 8 times faster... Apple has 3-4x advantage in performance per watt here.
 
Last edited:
Well, a single Apple GPU core is roughly the same as 2 AMD Navi CUs (both contain 4 32-wide ALUs). To match the 5500M (24 CUs) "on paper", we'd need 12 Apple cores running at the same frequency (1300Mhz), to match the 5600M (24 CUs), we'd need 20 Apple cores running at the same frequency (1035 Mhz). The GPU in the M1 seems to run at around 1270Mhz (two FLOPS per ALU per clock * frequency * 1024 ALUs = 2.6TFLOPS), so very close to the 5500M already...

So yeah, you are right, based on this overly simplistic napkin math, a 20-core Apple GPU running at the same 1270Mhz should theoretically be able to match the 5600M in performance, if the rest scales linearly.

P.S. Looking at these numbers, I am again really impressed about how energy-efficient Apple GPUs are. 2560 ALUs = 25 watts (without memory of course). That's the same amount of ALUs as in a 215W Nvidia RTX 2070 Super. Of course, we need to keep in mind that the Nvidia GPU runs 40% faster clock and has limited superscalar capabilities (it can execute a FP32+INT operation per clock, Apple can't do that), so it's going to be considerably faster, but not 8 times faster... Apple has 3-4x advantage in performance per watt here.
That's rather unrealistic, though, since a single 5600m comes with dedicated 8GB of ~400GB/s RAM vs ~60GB/s for a whole M1 SoC. Even doubling the RAM channels and assigning the entirety of the SoC bandwidth to the GPU only brings it up to about a quarter of a single 5600m. Regardless of how fast each M1X GPU core is (and M1 GPU cores are really not very fast, but rather power efficient), it can't do much when data starved.

I'm also really surprised with the dishonest and misleading desktop RTX 2070 comparison, since your posts tend to be technical and of high quality. If you already want to compare M1 GPU to RTX 2070 then 2070 super max-q is more fair, with peak power of 80-90W for the entire card and a multiple of M1 GPU performance. Of course a significant portion of this power, possibly even half, goes to 8GB VRAM, it's controllers and busses, and even more goes to power significant parts of the chip that support functionality that doesn't even exist on a M1 GPU. 2070 is also built on an ancient 12nm process vs the bleeding edge 5nm for the M1.
 
I'm also really surprised with the dishonest and misleading desktop RTX 2070 comparison, since your posts tend to be technical and of high quality. If you already want to compare M1 GPU to RTX 2070 then 2070 super max-q is more fair, with peak power of 80-90W for the entire card and a multiple of M1 GPU performance. Of course a significant portion of this power, possibly even half, goes to 8GB VRAM, it's controllers and busses, and even more goes to power significant parts of the chip that support functionality that doesn't even exist on a M1 GPU. 2070 is also built on an ancient 12nm process vs the bleeding edge 5nm for the M1.

It's a fair point, so let me explain why I chose that particular comparison. It was not my intension to be misleading or to downplay the Nvidia GPU — it is obviously gong to be much faster in a real world scenario (as I did point out in my post). I picked the RTX 2070 Super because it has 2560 ALUs (what Nvidia calls "CUDA cores") — exactly the same amount as a hypothetical 20-core Apple GPU (with 4x32 ALUs per core). My intention was to look at different designs with comparable compute cluster configurations at different power consumption levels and in different frequency ranges. You are right that I should have picked the Max-q laptop variant, since it has clocks that makes it very comparable to Apple Silicon. Looking at these side by side (all of them have 2560 ALUs):

2070 Super (1770 Mhz, 215 W, 9TFLOPS) = 23.8 Watts/TFLOP
2070 Super Max-Q (1290 Mhz, 90W, ~6.6TFLOPS) = 13.6 Watts/TFLOP
Radeon Pro 5600M (1030 Mhz, 50W, 5.3TFLOPS) = 9.4 Watts/TFLOP
20-core Apple GPU (1270Mhz, 25W + 10W RAM, ~6.5TFLOPS) = 5.4 Watts/TFLOP

To be more "fair" I have added 10 watts to our hypothetical Apple GPU to account for RAM power usage (it's likely to be a very conservative estimate, since M1 RAM draws less than 1 Watts in GPU-related benchmarks and less than 1.5Watts in memory-intensive tests).

What we see here is Apple design is capable of reaching the same peak FLOPS at 60% of Navi's power consumption or 40% of Turing power consumption at the same clock speed. There is no doubt that superior process node plays a role here, but I don't think that it explains all of the difference we see here. For instance, Ampere is manufactured at a 7nm node, but it's only 20-30% faster than Turing at the same power usage — and that includes all the microarchitectural improvements like improved superscalar execution etc... Even Navi at 7nm, with highly binned 5600M parts and very power-efficient HBM2 is only 30% more efficient than the Max-Q Turing on 12nm node... No, I think a substantial portion of the difference here is the GPU microarchitecture itself. It seems to me that Apple is using a very streamlined approach to GPU design, with their cores ending up smaller, simpler, and more efficient overall, even if they trade some functionality for it. I think Apple GPUs are really underestimated, as people traditionally tend to swipe mobile GPUs aside as underpowered and boring.

Now, please keep in mind that we are looking at TDP vs. theoretical peak FLOPS! Real world is different and that's where other factors come into play. Apple ALUs are much simpler than the competitors, and Apple Silicon (at least what we've seen before) doesn't have the bandwidth to hold it's own agains dGPUs, so dedicated GPUs with similar on-paper performance are ending up considerably faster in many synthetic compute benchmarks. I would really like to see what Apple Silicon can do if it has 300-400GBps available memory bandwidth...

On the other hand, Apple GPUs don't need nearly as much bandwidth for rasterization, so they really punch above their weight class in gaming benchmarks. The 2070 Super Max-Q scores around 43246 in Wild Life, M1 with 8 GPU cores scores around 16K if I remember correctly — our 20-core 25W GPU would translate to somewhere around very respectable 35-40k.

That's rather unrealistic, though, since a single 5600m comes with dedicated 8GB of ~400GB/s RAM vs ~60GB/s for a whole M1 SoC. Even doubling the RAM channels and assigning the entirety of the SoC bandwidth to the GPU only brings it up to about a quarter of a single 5600m. Regardless of how fast each M1X GPU core is (and M1 GPU cores are really not very fast, but rather power efficient), it can't do much when data starved.

True, but I think it's obvious that any high-end GPU system Apple ships will need to have appropriate memory bandwidth available to it. The rumored M1X should be fine with 2x of M1's bandwidth (or maybe even 1.5x if they use LPDDR5 instead). To play with the big boys Apple would need wider memory. I am fairly confident that 200GBs should be enough to reach 5600M/2060 MaxQ rasterization performance levels due to savings from TBDR. Compute might be a different story but then again I have no idea how all these things interact in the real world. Synthetic benchmarks will obviously fly on a GPU with very fast dedicated RAM, but do they take the time needed to copy the data to the GPU into account? Similar, for many productivity workloads the elimination of PCIe data transfer can more then offset the loss of some RAM bandwidth, especially in a heterogenous processing environment (data moving between the CPU, GPU and ML coprocessors).
 
  • Like
Reactions: eltoslightfoot
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.