What’s the point of ARM?

leman · Jan 14, 2023

Xiao_Xi said:
I worded that sentence very poorly. Isn't the difference between a TBDR GPU and an IM GPU that the TBDR GPU splits the workload that an IM GPU does into smaller chunks? If so, how could an IM GPU require more power if they both do the same work?

Not quite right. A TBDR GPU first rasterises all primitives that overlap with a given tile and constructs a visibility map. Very simplified, you can imagine it as storing the primitive that is visible in a pixel for each single pixel. E.g. if the tile is 16x16 pixels you will have a 16x16 grid (256 cells) where each cell points to the primitive visible in it. After all primitives have been rasterised and the visibility map contains the final information, the entire tile is shaded at once in parallel. This is basically dispatching a 16x16 compute program, where the program will fetch the primitive information, compute the per-point data (color, coordinates, texture info) and do whatever you need to do. The important thing though is that only visible pixels are ever shaded and that you shade the entire 16x16 grid of pixels at once. This is obviously good for hardware as you never shade occluded pixels, and you can use the data-parallel nature of GPU SIMD more efficiently. Another important consequence of this is that your memory requests are likely to be cache-aligned.

An IM GPU instead shades pixels directly during rasterization (which is why it's called "immediate mode"). This means that you might end up shading occluded pixels (which cannot happen with TBDR). Also, since the IM shader will ever only operate on pixels from one triangle (unlike TBDR where different pixels can come from different triangles), it has efficiency problems at triangle edges where some of the pixels have to be skipped. Furthermore, triangles are not guaranteed to occur close to each other, so you might be shading pixels in different areas of the screen, which will trash your caches.

At least this is the theory. The reality is more complicated. Modern IM GPUs do use tiling to improve cache locality and modern rendering techniques already minimise overdraw. Modern games increasingly spend more time doing complex post-processing which requires a lot of bandwidth, and now we even have techniques that skip fixed-function rasterisation completely and rasterise triangles inside compute shaders. So there are fewer and fewer opportunities for TBDR to play to its advantage, and it's not like TBDR approach is free either (the processed primitives have to be stored somewhere prior to shading, the entire thing doesn't work with transparency, and getting TBDR to work in the first place is insanely complicated). At the same time, Apple's proprietary extensions combines the tiling functionality with compute shaders, which allows many algorithms to be executed more efficiently, dramatically reducing the need for memory bandwidth. But this of course requires additional effort...

Xiao_Xi · Jan 14, 2023

leman said:
An IM GPU instead shades pixels directly during rasterization (which is why it's called "immediate mode"). This means that you might end up shading occluded pixels (which cannot happen with TBDR).

How can this be? IM GPUs also use z-buffered rasterization.

leman said:
A TBDR GPU first rasterises all primitives that overlap with a given tile and constructs a visibility map. Very simplified, you can imagine it as storing the primitive that is visible in a pixel for each single pixel. E.g. if the tile is 16x16 pixels you will have a 16x16 grid (256 cells) where each cell points to the primitive visible in it. After all primitives have been rasterised and the visibility map contains the final information, the entire tile is shaded at once in parallel. This is basically dispatching a 16x16 compute program, where the program will fetch the primitive information, compute the per-point data (color, coordinates, texture info) and do whatever you need to do. The important thing though is that only visible pixels are ever shaded and that you shade the entire 16x16 grid of pixels at once.

I still don't understand how a TBDR GPU can minimize overdrawing better than an IM GPU if both types of GPUs use the same algorithm to draw pixels.

By the way, Apple explains how Apple GPUs work in this presentation.

Harness Apple GPUs with Metal - WWDC20 - Videos - Apple Developer

Create visually stunning, high-performance apps and games when you combine the power of Apple GPUs with Metal, the modern foundation for...

developer.apple.com

leman · Jan 14, 2023

Xiao_Xi said:
How can this be? IM GPUs also use z-buffered rasterization.

Xiao_Xi said:
I still don't understand how a TBDR GPU can minimize overdrawing better than an IM GPU if both types of GPUs use the same algorithm to draw pixels.

Imagine two full screen triangles, A and B, drawn in this oder. Assume that A is completely occluded by B. In other words, only B is visible.

An IM GPU will rasterize and shade A first. Since there is nothing on the screen yet (z buffer is empty), every single pixel of A will be shaded and stored. B is rasterized and shaded second. Since every pixel of B passes z test (as it is in front of A), every pixel will be shaded and stored. Result: every triangle is rasterized once, every pixel is shaded and emitted twice.

A TBDR GPU will first rasterize A. Since every pixel of A passes the z test the tile visibility map cells will point to A. No shading is done yet. B is rasterized next, since every pixel passes z test the tile visibility map cells are updated and now point to B. Note that updates to this map are very efficient as they do not leave the GPU core memory. Now that both triangles are rasterized, the tile is shaded at once, using the information about visible pixels. Result: every triangle is rasterized once, every pixel is shaded and emitted once.

Hope it’s a bit more clear now?

The common explanation of TBDR is that it „sorts triangles“ and then only draws the visible ones. That would be fairly dumb as sorting triangles is very expensive. But the end effect is the same. You can also imagine it as doing a depth–only rendering pass (a common optimization technique for IM GPUs back in the olden day), only doing it smartly.

P.S. One interesting consequence of TBDR is that there are never data races when shading pixels. With IMR there is always a risk to process two triangles that have common pixels in parallel. I have no idea how GPUs deal with it (there must be some sort of resolve mechanism), but I can only imagine that it’s far from trivial. A TBDR GPU never has to worry about this stuff. Which also means that committing pixels is just a very efficient memory block copy, and that you can do fancy things like programmable blending or programmable MSAA very cheaply.

avkills · Jan 14, 2023

Joe Dohn said:
If you need that much video RAM and you have money to spare, all it takes is connecting multiple GPUs together, which you CAN do in the PC platform.

That strategy does not work for heavy 3D work because the scene data has to be loaded into both GPUs at the same time. So having different GPUs the one with the lowest VRAM is going to be the limit if you try and render across multiple GPUs. Max nVidia VRAM for a single card is 48GB last I checked.

So even though I have a W6800X Duo, I am still going to be limited to 32GB of VRAM for GPU rendering using both GPUs. And I have confirmed via Octane that if I stick my old GPU back in (580X) and render using all 3 it is actually slower than just using the W6800x Duo.

Xiao_Xi · Jan 14, 2023

leman said:
Hope it’s a bit more clear now?

Yes, thank you.

By the way, you may be interested in this new algorithm for predicting visibility.

Omega-Test: A predictive early-Z culling to improve the graphics pipeline energy-efficiency

upcommons.upc.edu

leman · Jan 14, 2023

Xiao_Xi said:
Yes, thank you.

By the way, you may be interested in this new algorithm for predicting visibility.

Omega-Test: A predictive early-Z culling to improve the graphics pipeline energy-efficiency

upcommons.upc.edu

Interesting read, thanks! Their technique will of course produce incorrect rendering results in many cases but I suppose that is tolerable for many simple games on low–end hardware.

Unregistered 4U · Jan 14, 2023

Joe Dohn said:
If you need that much video RAM and you have money to spare, all it takes is connecting multiple GPUs together, which you CAN do in the PC platform.

That does not yield a non-compute GPU with 70 Gigs of RAM that’s also attached to the CPU. That gives you the same CPU RAM separate from GPU RAM that most applications are coded for today.

krspkbl · Jan 14, 2023

If Apple wanted to make the most powerful PC they could then they would do this. The thing is Apple wants to make a relatively and moderately powerful PC while also making themselves a lot of money/profits.

Mac Pro will no doubt be the most powerful Mac ever made but will it be good value or as powerful as other systems? No. If someone needs the power of Epyc cpus and RTX 4000 cards they won't want a Mac. RTX 4000 cards are aimed at gamers. Nvidia has their own line up of GPUs for professional/studio/AI/enterprise work. You wouldn't pair an Epyc with a 4000 card in the long term.

They aren't waiting on a Mac Pro lol.

Xiao_Xi · Jan 14, 2023

bobcomer said:
Come on guys, chill a bit and stop with the insults, I'm enjoying reading the information in this thread!

I wish one of the moderators could temporarily close this thread for a couple of days and reopen it next week.

Philip Turner · Jan 14, 2023

Xiao_Xi said:
I wish one of the moderators could temporarily close this thread for a couple of days and reopen it next week.

I guess this is the point of ARM ;(

Confused-User · Jan 14, 2023

Joe Dohn said:
If you need that much video RAM and you have money to spare, all it takes is connecting multiple GPUs together, which you CAN do in the PC platform.

Someone else already explained that this doesn't work. But even their issue wasn't relevant - and it may not be if you're running your own code - you still face the issues of latency and bandwidth between the multiple GPUs (and the CPU, if that's involved). Nvidia has a custom direct-link solution because PCIe isn't even close to good enough for this. But for problems with poor locality, even that is drastically inferior to having a true single large pool of memory.

Of course the M1s aren't *exactly* a single large pool of memory. The Ultra, in particular, has that insane 20tbps interconnect precisely because you've got two whole separate pools (which themselves are somewhat divisible). But they're closer than anything else, at least for now.

deconstruct60 · Jan 14, 2023

Zest28 said:
For the Mac Pro, why not simply put a 96-core EPYC AMD CPU in it with a RTX 4090 (if Apple can solve their politics with NVIDIA) while retaining user expandability and repairability for the Mac Pro.

1. macOS is limited to 64 threads. Any x86 processors with SMT(Hyperthreads) > 32 cores is relatively non effective for macOS. Up to 64 cores could probably use firmware settings to permanently fuse off the SMT functionality but performance on a wide variety of workloads isn't going to do all that well.

Your question is off base. Where there "have to build a clone of AMD/Intel" servers chips is way off in the weeds of Apple has to build a CPU package that doesn't match their operating system? Apple is not likely at all to run off and spend giant sums of money making other people's operating systems better. M-series has a 'M' primarily because it is suppose to make macOS (and iPadOS) run better. That is the primary objective.

Severthehome is using dual Eypc/Xeon SP systems in the benchmark on this

"...
Still, we wanted to show why acceleration matters in a use case that was pertinent to us. As a result, we bootstrapped the nginx QAT acceleration to the 32-core 8462Ys, and then ran the full STH nginx stack with the database (and minus back-end tasks like backups/ replication and such) all on a single node and compared it to the AMD EPYC 9374F. Here is what we saw:

Intel-v-AMD-at-32-cores-STH-nginx-stack-Performance-QAT-impact.jpg

... "

Basically that accelerators can matter as much as core count. QAT off ... AMD wins. QAT on and Intel wins.
The accelerators that Apple attaches to the SoC matter. It is not always just simply a matter of "most cores wins".

Are the new SP gen 4 accelerators going to completely stop the server market share bled by Intel to Eypc Zen4? No. Will it significantly slow the bleed down in some key submarkets if Intel ships enough volume at a quick pace (and doesn't get too greedy on price) ? Probably yes.

Apple isn't trying to place a SoC in the generic server market. They are trying to place something in the single user workstation market. If the accelerators match the user workload then completely "missing the boat" if myopically just looking at CPU core counts. There is a reasonably high probably that Apple will selectively use the additional of high density logic accelerators over some key areas to offset the pressure to add more than 64 CPU cores to their systems. In fact, it wouldn't be surprising if Apple stayed down below 54 for several generations.

Everything "graphical output" that some folks try to load up CPU cores with, Apple will probably try pretty hard to push over the GPU cores ( where the macOS/Unix thread limit doesn't matter). AI/ML inference workloads ... AMX and NPU cores ... again where the macOS thread limit doesn't matter. Video de/encoding ... fix function; no macOS thread limits . Imaging analysis calculations ... to fixed function ; where no macOS thread thread limits.

There likely will not be very high overlap with the Intel Xeon SP accelerators. Doubtful Apple is going to do triple backflips to make Ethernet data traffic go faster. Database query accelerations ... probably not. I think Apple has some compression accelerators (besides video) already so there is some overlap. Encryption also.

But the general trend line is to put multiple dies into one package/socket. AMD and Intel are already well on that path. Apple has a "too chunky" chiplet with M1 generation , but that probably isn't a permanent miscue.

Zest28 said:
Since Mac Pro usually supports dual chips, Apple could even put a 192-core AMD CPU in it even.

The Mac Pro hasn't build 'new' support for dual CPU packages since 2009. [ 2010 and 2012 were largely just rehashes of the same logic board with minor differences.] So from 2006-2009 support two packages for 3 years . From 2013 - 2022 support for just one package 9 years. So the actively 'usually' is not particularly supports. Even if throw in a stale bone of 2010-2013 it is still less than half the Mac Pro era. ( if try to go back and drag in the PowerMac systems then digging an even deeper, duplicity hole there. )

Even in the sever space multiple package set ups are dying.

Dell product manager as 'guest' author at Next Platform in 2019. ( Sales numbers since have only confirmed this)
"... Dual socket servers are creating performance challenges due to Amdahl’s law – that little law that says the serial part of your problem will limit performance scaling. You see, as we moved to the many core era, that little NUMA link (non-uniform memory access) between the sockets has become a huge bottle neck – ... "

Why Single-Socket Servers Could Rule the Future

Multi socket servers have been around since the dawn of enterprise computing. The market and technology has matured into 2-socket x86 and multi core

www.nextplatform.com

similar notion, same author later that year.
https://www.dell.com/en-us/blog/4-new-reasons-to-consider-1-socket-server/

more recent article.
" ...
AWS is using Graviton and Gravition2 single-socket machines in its vast EC2 server fleet. Google, as we have previously reported, is rolling out its Tau instances on the Google Cloud based on a single-socket “Milan” Epyc 7003 server node to drive down costs while driving up performance. Significantly, Google says that a single-socket 60-core Epyc server can offer 56 percent higher performance and 42 percent better bang for the buck than Amazon Web Services can do with its single-socket Graviton2 nodes. ... "

Is The Shift To Single-Socket Servers Starting?

One of the key strategic moves that AMD made when it architected its comeback in the datacenter was to beef up the compute, I/O, and memory on a single

www.nextplatform.com

Does Grativon 2/3 run 100% of all Amazon Web services workoads ? No. Do they run enough to pay for Graviton R&D and production? Yes. Running 100% of all the workloads is a 'fake" requirement. It is not required. The Mac Pro running 100% of all possible x86_64 workloads isn't a serious requirement either. Will Apple be able to sell everything to everybody? No. Is that an Apple requirement? No.

The next Mac Pro is extremely unlikely to support multiple CPU packages. Multiple chiplets of a single coherent Package Yes. Multiple full blown packages? Probably not.

Zest28 said:
Does Apple really believe the M2 Extreme would beat a 192-core AMD CPU and a RTX 4090? Heck, you can probably put multiple RTX 4090 in the Mac Pro even (if Apple solves their politics with NVIDIA).

Wrong question. Real question is does Apple absolutely need a server chip and 4090 killer to create a reasonably good single users workstation? No. The misdirection here is that Apple 'has to' have a 4090 killer. That would be a 'nice to have' , but it isn't absolutely necessary. There will be tons of workstations sold even in the x86 space that don't have a 4090 in them. A decent number will , but a substantially larger number won't. In the single user workstation space Threadripper Pro will likely outsell Eypc.

Zest28 said:
For laptops, I get it. ARM offers nice battery life, but a Mac Pro has no battery life.

That is goofy notion. Back to the ''why single server servers " article from 2019.

"... To fix this we need more pins and faster SERDES (PCIe Gen4/5, DDR5, Gen-Z), 1-socket enables us to make those pin trade-offs at the socket and system level. .."

Abnormally high power consumption is part of the reason Intel is 'stuck' higher socket consumption problems. And yes even datacenters have a power budget. The long distance , ever faster SERDES tend to soak up power. Which leads to heat , which tends to migrate toward the CPU core's logic. The thermal management system will at some point chop the frequency of the cores to try get back on track with the thermal budget and ..... performance will go down.

For a single user workstation having the 12 , 24 , 48 SoCs having all the same, more than decently high, single threaded performance is more a feature than a "problem". Shouldn't "have to" buy alot more cores don't need if single threaded is all a user is looking for. Likewise should not have to start trading off single threaded performance (or giant gobs of money) if need a mix of single and multithread workloads.

Even if it is just the CPU cores if they all generate lots of heat then have the same issue the Mac Pro 2013 ran into with single (jointly used ) thermal transfer system.

The modern ARM instruction set has very little to do with "battery life". The Ampere Computer server process has stats of

"... Ampere Altra featuring 80 cores fabricated on TSMC's N7 process for hyperscale computing.[32][33][34] It was the first server-grade processor to include 80 cores and the Q80-30 conserves power by running at 161 W in use.[32] The cores are semi-custom Arm Neoverse N1 cores with Ampere modifications.[35] It supports a frequency of up to 3.3 GHz with TDP of 250 W, 8ch 72-bit DDR4, up to 4 TB DDR4-3200 per socket, 128x PCIe 4.0 Lanes, 1 MB L2 per core and 32 MB SLC.[33][34] ..:"

Ampere Computing - Wikipedia

en.wikipedia.org

250W is not going into anyone's reasonably portable laptop. [ both Intel and AMD have taken to slapping a laptop label on a desktop SoC for their super performance laptops chip meant to paired with an equally hot discrete GPU. So there is a whole class of 'plug it in the wall the vast majority of the time" laptops now. So a 250W laptop CPU isn't 'outrageous' anymore. It should be, not is it is not. ] Neither is 4TB . Neither is 128 PCI-e v4 lanes.

Arm is less constipated than x86_64. They aren't trying to drag around every instruction to try to clone Multics and run 16-bit DOS into the 21st century. Armis an instruction set that allows implementors to drop really old stuff. Apple habitually drops some really old stuff every 10-15 years so it is a better match in philosophy.

trusso · Jan 14, 2023

CChrisG said:
Because it makes sense to focus on one architecture so that the software can be 100% optimized for it.

I agree, but you would think a company as large and with as deep pockets as Apple could afford to maintain quality software for multiple architectures… or just quality software.

Confused-User · Jan 14, 2023

deconstruct60 said:
1. macOS is limited to 64 threads. Any x86 processors with SMT(Hyperthreads) > 32 cores is relatively non effective for macOS. Up to 64 cores could probably use firmware settings to permanently fuse off the SMT functionality but performance on a wide variety of workloads isn't going to do all that well.

Surely you don't think that that limitation would be challenging for Apple to remove, were they ever to produce a system that needed it? That's a very poor argument.

It would be crazy for them to do another x64 system this year like the one suggested, but I was very disappointed that the 2019 Pro didn't use AMD processors at least. (I was unsurprised by their continuing to shun nVidia.)

BTW, despite the poor opening argument and some language issues your post makes some good points. I agree that we're very unlikely to see a multiple-socket machine from Apple.

leman · Jan 15, 2023

Confused-User said:
Surely you don't think that that limitation would be challenging for Apple to remove, were they ever to produce a system that needed it? That's a very poor argument.

They could remove it fairly easily, but there might be a performance hit on all hardware. Of course, with their current sealed volume technology etc they could install different versions of the kernel on different machines. That would be interesting.

The bigger question is whether they need more than 64 cores on Apple Silicon. It seems to me that 64 of Apple cores would more than match 96-core Genoa, with at least 30-40% lower power consumptio.

Confused-User · Jan 15, 2023

leman said:
They could remove it fairly easily, but there might be a performance hit on all hardware. Of course, with their current sealed volume technology etc they could install different versions of the kernel on different machines. That would be interesting.

There would definitely not be a performance hit on smaller hardware. As you say, they could use different kernels (though sealed volume tech is irrelevant to this), but that's unlikely to be the path they'd take. Simply detecting the number of cores available at early boot time and sizing various data structures to suit is neither a new idea, nor a challenging one.

The bigger question is whether they need more than 64 cores on Apple Silicon. It seems to me that 64 of Apple cores would more than match 96-core Genoa, with at least 30-40% lower power consumptio.

Possibly... it would depend on the power and cooling envelopes of the systems, and most critically on what Apple did in the uncore. I've written about this here a few times - this is the big open question for Apple, when it comes to building bigger systems. I expect them to do a good job on this, despite the negative surprises in the M1 Max and especially Ultra, if they ever do build bigger systems. They took some big risks, some of which didn't pay off so well, or were stymied by other factors, but that was just the very first step. I think they'll get there.

The big question is, how do they organize such a big chip? Everyone focuses on core performance, but that's just half the story. Core chiplets hanging off an IO chiplet, the way AMD does it, would be a very drastic change from their current setup. It's certainly possible but they'd wind up discarding a significant part of many years of learning. And if they're not doing that, then what? There's only so much shoreline; you can't just widen memory and add more back-to-back Ultra-style links. At least not without some really serious new packaging engineering which isn't there yet.

The thing I'm most wondering is what do they do about caches in a larger system? IIRC, right now they're duplicating and carrying around tags for all L2 caches in every L2 cache. I am really skeptical about their prospects for trivially scaling this up to a really large number of cores. I can think of several ways they might deal with this, and they each have challenges. I also have no idea if they can scale up their NoC. And what about the SLC's role as a chipwide arbiter of memory access - will that scale? Many many questions, but no answers yet.

leman · Jan 15, 2023

Confused-User said:
There would definitely not be a performance hit on smaller hardware. As you say, they could use different kernels (though sealed volume tech is irrelevant to this), but that's unlikely to be the path they'd take. Simply detecting the number of cores available at early boot time and sizing various data structures to suit is neither a new idea, nor a challenging one.

Dynamic specialisation in the kernel? Sounds messy to me... specialised kernels would probably work better. If I remember correctly, Linux does it by providing a compile-time constant for the CPU mask size. Apple could do something similar I suppose.

Confused-User said:
The thing I'm most wondering is what do they do about caches in a larger system? IIRC, right now they're duplicating and carrying around tags for all L2 caches in every L2 cache. I am really skeptical about their prospects for trivially scaling this up to a really large number of cores. I can think of several ways they might deal with this, and they each have challenges. I also have no idea if they can scale up their NoC. And what about the SLC's role as a chipwide arbiter of memory access - will that scale? Many many questions, but no answers yet.

Do you have more specifics on this L2, I would be curious to read about it in more detail.

Reading name99's research, it seems that the scaling is approached in a very straightforward manner. SLC cache is a memory side cache, so it directly maps to a specific memory controller/section of physical RAM.Any memory request will go to a specific part of the SLC depending on the physical address, no need to bother with coherency or anything like that, as content of SLC on different chips never overlaps. The interconnect joins together the chip networks, allowing processors on one chip to make requests to the SLC/controllers of the other chip. Sorry if what I write here sounds naive, I am just an interested hobbyist in these things, I am sure that the details are much more complex and there are issues I am not seeing.

From my naive perspective, it sounds like this approach should be scalable to a certain degree, as long as one can feasibly maintain high-performance intra-chip network extension bridges. The fact that the access is non-uniform can probably effectively hidden (at some cost of latency), but shouldn't be too big of a deal. What's interesting is that Apple has some new patents that describe processor networks and communication protocols optimised for this kind of scenarios. One of them seems to be describing overlapping network topologies of different types to facilitate efficient communication depending on the need (https://patents.google.com/patent/US20220334997A1) while another describes an efficient data transfer protocol for accessing remote memory controllers (https://patents.google.com/patent/US20220342588A1). The interesting bit about the later patent is that they progressively drop address bits the closer the message gets to the respective destination. Could be a neat way to reduce the bandwidth requirement and energy cost of communicating over long distances.

P.S. Looking at the first mentioned patent again, this figure and quote might be of interest

Interface circuitry(e.g.,serializer/deserializer (SERDES )circuits) [...] may be used to communicate across the
die boundary to another die.Thus, the networks maybe scalable to two or more semiconductor dies .For example,
the two or more semiconductor dies may be configured as a single system in which the existence of multiple semiconductor dies is transparent to software executing on the single system. In an embodiment, the delays in a communication from die to die may be minimized, such that a die-to-die communication typically does not incur significant aditional latency as compared to an intra-die communication.

https://patentimages.storage.googleapis.com/4f/d6/fb/772929762c98bf/US20220334997A1-20221020-D00004.png

Xiao_Xi · Jan 15, 2023

Jack Dongarra, gave a lecture on, among other things, the impact of how memory bandwidth has failed to catch up with computing capabilities in modern computers.

Confused-User · Jan 15, 2023

leman said:
Dynamic specialisation in the kernel? Sounds messy to me... specialised kernels would probably work better. If I remember correctly, Linux does it by providing a compile-time constant for the CPU mask size. Apple could do something similar I suppose.

They could but it's likely not needed. It's not like kernel memory has to be so rigidly laid out - in fact, you don't want it to be; that's why there are things like KASLR. I can imagine a large-core-count kernel having some small disadvantage over a "normal" one, but the efficiency break would likely be at 256 CPUs unless they're carrying around some status bits along with the cpu mask.

Do you have more specifics on this L2, I would be curious to read about it in more detail.

The only thing I can point you to off the top of my head is name99's stuff, which I see you've absorbed.

Reading name99's research, it seems that the scaling is approached in a very straightforward manner. SLC cache is a memory side cache, so it directly maps to a specific memory controller/section of physical RAM.Any memory request will go to a specific part of the SLC depending on the physical address, no need to bother with coherency or anything like that, as content of SLC on different chips never overlaps. The interconnect joins together the chip networks, allowing processors on one chip to make requests to the SLC/controllers of the other chip. Sorry if what I write here sounds naive, I am just an interested hobbyist in these things, I am sure that the details are much more complex and there are issues I am not seeing.

It's not that it's naive, so much as you have to realize that the devil is in the details. For example, it's easy to say that the interconnect joins the chip networks, but *how* physically does it do that? (It's not like it's an Ethernet and you can just throw a switch in between the two... though that is one conceptual approach you can take.) And once you've solved that problem for an Ultra, two back-to-back Max chips, does that solution scale at all? It does not, not without at least some additional work, because you can't lay out more than two chips all with direct links to each other of that scale. Thus is born the "I/O die"... maybe.

From my naive perspective, it sounds like this approach should be scalable to a certain degree, as long as one can feasibly maintain high-performance intra-chip network extension bridges. The fact that the access is non-uniform can probably effectively hidden (at some cost of latency), but shouldn't be too big of a deal. What's interesting is that Apple has some new patents that describe processor networks and communication protocols optimised for this kind of scenarios. One of them seems to be describing overlapping network topologies of different types to facilitate efficient communication depending on the need (https://patents.google.com/patent/US20220334997A1) while another describes an efficient data transfer protocol for accessing remote memory controllers (https://patents.google.com/patent/US20220342588A1). The interesting bit about the later patent is that they progressively drop address bits the closer the message gets to the respective destination. Could be a neat way to reduce the bandwidth requirement and energy cost of communicating over long distances.

Yes, it's not like any of this is unsolvable. There are, as I said earlier, multiple approaches they could take. But they all require serious engineering, careful tradeoffs, and great execution, if you want the result to be good. Think of the *hundreds* of patents (thousands?) that Apple has just to get their uncore to where it is today. And it isn't where it needs to be yet, at least at the high end. (The M2 may bode well for that high-end but it's just too small to really tell us for sure.)

Confused-User · Jan 15, 2023

And BTW, wow did the mods take a broom to this thread. A couple of my posts got caught up in the sweep. Just as well, I think nothing worthwhile got hit, and all that toxicity is gone.

leman · Jan 15, 2023

Confused-User said:
For example, it's easy to say that the interconnect joins the chip networks, but *how* physically does it do that?

I suppose the main challenge is engineering a fast enough inter-chip connection, right? Or are there other more subtle challenges?

Confused-User said:
And once you've solved that problem for an Ultra, two back-to-back Max chips, does that solution scale at all? It does not, not without at least some additional work, because you can't lay out more than two chips all with direct links to each other of that scale. Thus is born the "I/O die"... maybe.

I can imagine a bunch of different topologies, all with different advantages and disadvantages to them (again, from my amateur perspective).

But I think the most interesting thing is that all these patents paint a clear picture of how Apple engineers seem to be approaching this problem. I mean, the concept of on-chip network extension is extremely obvious but also quite ambitious, as the off-die bandwidth required to do this seamlessly is tremendous. But I like how it completely avoids the problems of coherency and non-uniform access. I hope this will work out for them. The technology is definitely cool.

Unregistered 4U · Jan 15, 2023

Confused-User said:
Yes, it's not like any of this is unsolvable. There are, as I said earlier, multiple approaches they could take. But they all require serious engineering, careful tradeoffs, and great execution, if you want the result to be good. Think of the *hundreds* of patents (thousands?) that Apple has just to get their uncore to where it is today. And it isn't where it needs to be yet, at least at the high end. (The M2 may bode well for that high-end but it's just too small to really tell us for sure.)

And on the software side, if there’s a particular compilation scheme that works most efficiently with the hardware methods chosen, they can ensure that’s what the compiler spits out.

kevcube · Jan 15, 2023

Basic75 said:
Then you must have missed all the messages from people saying that they regularly need like 1TB of RAM for their work. Or more.

that type of computational workload isn't done on desktop machines

kevcube · Jan 15, 2023

AdamBuker said:
It's not just raw computing power that pro's need. It's also expansion. Many musicians and recording studios need PCIe expansion cards for multiple low latency audio inputs. Others need fiber channel networking or additional I/O expansion. Others might need a card with legacy I/O for certain devices. Other's might use PCI slots for storage beyond Apple's internally provided storage, and a great many would like to add additional GPU cards to their computer as well.

Now many such cards might work fine in a PCIe enclosure running over Thunderbolt, but for many workflows that will add latency to an unacceptable degree. Having a computer with internal expansion allows it to become whatever the user needs it to be, especially if such a machine is kept and repurposed for another use after its primary use has been superseded by something newer.

Honestly, I hope the next Mac Pro keeps the same amount of PCI slots. When the 2019 Mac Pro came out, it felt a bit nostalgic as Apple hadn't made a machine with that many expansion slots since the good old Apple ][.

thunderbolt is literally PCIe lanes, so there should be no perceptible latency unless you're measuring the speed of light or something. I can understand the pain of tons of enclosures though.

I'll eat my words if either of the big two GPU manufacturers ever produce a driver for apple silicon. So GPUs is off the list. That's something people will continue to use a GPU cluster for. (Rendering or ML)

Audio+Storage, thunderbolt is as good as PCI minus the additional desk space. But you gain the portability when you use thunderbolt. So I see no upside to expandable apple silicon in a Mac-pro sized enclosure.

Confused-User · Jan 15, 2023

leman said:
I suppose the main challenge is engineering a fast enough intra-chip connection, right? Or are there other more subtle challenges?

I believe from context you meant inter-chip connection, like the 20tbps link in the ultra - or at least inter-chiplet, in some hypothetical future Apple Mx design. And yes, it is more subtle.

Bandwidth is very important. But they've already shown some ridiculously high bandwidth in the Ultra. Latency, however, is arguably more important. (Equally critical, at least.) Dealing with that can be very challenging. And yes, sometimes you can "hide" it, but doing that takes engineering, and sometimes you just can't. Or you can but there are tradeoffs. Fundamental physics can be very unforgiving - wire length will impose speed-of-light (well, speed-of-electrons) limitations, for example.

Way more subtle than that are the architectural challenges. Just how many tags are you going to carry around duplicated in all your caches? Will you go to a hierarchy under multiple SLCs (this is sort of what you get with the Ultra)? Just how complicated can that get without increasing latency too much? (There's tons more like that, that was just the thing that sprang to mind first.)

I can imagine a bunch of different topologies, all with different advantages and disadvantages to them (again, from my amateur perspective).

But I think the most interesting thing is that all these patents paint a clear picture of how Apple engineers seem to be approaching this problem. I mean, the concept of on-chip network extension is extremely obvious but also quite ambitious, as the off-die bandwidth required to do this seamlessly is tremendous. But I like how it completely avoids the problems of coherency and non-uniform access. I hope this will work out for them. The technology is definitely cool.

It doesn't *really* eliminate the issue of NU(M)A. They may have things on the ultra fast enough that they can ignore the problem, at least in the first generation, but physics is physics. For a given core, some memory *is* slower than other memory, due to having to go across that InFO-LI link. Your choices are:
1) Ignore the issue. Plausible for a first gen. Implausible for Apple *for the long term*. They are relentless about being smart about design.
2) Slow more-local memory down to less-local memory speeds, so it's all really uniform. Worst choice, and I don't see them doing this.
3) Be smart, minimize nonlocality. Suck it up when you can't. This is classic Apple design. It's always either where they are or where they're going. Controlling both hardware and software makes it easier.

Of course they're not the only ones to have ever faced this. First-gen EPYC (Zen) chips had a very NUMA design, which was addressed to various extents in Linux and Windows, with varying and incomplete success. Dual- (and quad-, and more) chip Intel servers, roughly the same story.

That's kind of the point here. All the problems Apple faces have been solved before, multiple times. The question is, how much smarter will they be with their solutions?

What’s the point of ARM?

macrumors Core

macrumors 68000

macrumors Core

macrumors 65816

macrumors 68000

macrumors Core

macrumors G4

macrumors 68030

macrumors 68000

macrumors regular

macrumors 6502a

macrumors G5

macrumors 6502a

macrumors 6502a

macrumors Core

macrumors 6502a

macrumors Core

macrumors 68000

macrumors 6502a

macrumors 6502a

macrumors Core

macrumors G4

macrumors 6502

macrumors 6502

macrumors 6502a

Our Staff