Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
so do Apple Silicon chips still use some additional more responsive cache memory, or is the system RAM's latency actually low enough this time around so it's not the bottleneck for the CPU/GPU anymore?

Apple Silicon RAM latency is actually rather high — much higher than the usual desktop RAM. This is the result of more complex RAM protocols and interfaces (LPDDR generally has higher latency than DDR). So yes, Apple uses a complex cache hierarchy like everyone else, it's just that Apple uses larger caches in their designs (which again helps with energy efficiency as expensive RAM requests can be batched/delayed). One interesting thing to mention is that Apple doesn't use a traditional shared L3 in their CPUs, but instead a large L2 shader between a CPU cluster (4 cores), so their L2 kind of combines the roles of L2 and L3. There is also a shared system-on-a-chip cache, but its main purpose appears to be handling memory coherency and data sharing between various coprocessors.

And from what I understand Apple's cache latency is not stellar, many x86 CPUs have better latencies (due to caches being smaller is what I gather, but I have very poor understanding of how caches work). Apple can compensate some of this by having larger load/store queues and batching the requests.
 
Last edited:
Does this mean that it is useless for AMD and Intel to manufacture CPUs with higher bandwidth unless they include a console-like iGPU?
No, it just means that Apple engineered the M1 Max in a way that the bandwidth to RAM is greater than the bandwidth to the CPU cores alone.
 
A gaming CPU probably doesn't need that kind of bandwidth, because it doesn't deal with that class of problems.
Don't Playstation 5/Xbox X APUs have higher bandwidth than AMD's laptop APUs?

No, it just means that Apple engineered the M1 Max in a way that the bandwidth to RAM is greater than the bandwidth to the CPU cores alone.
Doesn't Apple need to increase bandwidth to keep up with the GPU cores? Would an Mx Extreme (2x Mx Ultra) with the same bandwidth as the M1 Ultra make sense?
 
i don't know about very early architecture, but even my first IBM compatible PC (80286 @8MHz) had to do wait cycles for the RAM.
no idea about my computers before that, but i guess it was the same there too.
can't remember if it had cache memory, but my next one (80386 @40MHz) most certainly had, to somewhat compensate for the way too slow RAM. and of course the RAM itself had tons of waitstate again set up in the BIOS. when i set those too low, they system became quicker, but also started to become very glitchy.

of course that was about latency, and not necessarily about bandwith.

so do Apple Silicon chips still use some additional more responsive cache memory, or is the system RAM's latency actually low enough this time around so it's not the bottleneck for the CPU/GPU anymore?
It was about both latency and bandwidth. Or rather, it was about latency, but it affected bandwidth.

(Random recollection: Do you remember that caches used to be separate chips on motherboards, back when chipsets were actual sets of chips?)

RAM latency will never NOT be a bottleneck, at least unless/until we develop some fundamentally new tech. I don't see that happening any time soon.

RAM latency is enormous compared to the caches, and even more so compared to registers. It's no different for AS than for Intel or AMD chips. Well, it's a little different - I think it's higher for AS, at least so far, because all AS chips to date are using LPDDR. However there are potentially compensating advantages (twice the channels compared to same-size DDR).

I don't think Apple's caches are "more responsive" but I'm to lazy to go look up latencies for all the different architectures. However they're definitely larger than contemporaries. And that matters a lot. Larger caches -> higher hit rates = fewer cache misses -> fewer CPU stalls. Generally speaking, larger caches are slower, so there's a tradeoff here. Apple still seems to have done a better job with caches overall.

Apple's caching is also, arguably, a lot smarter by design. In particular the SLC plays a huge role in coherency, and being tightly linked to all the SoC elements, not just the cores, gives AS a huge advantage over other architectures in certain common situations (even against AMD APUs, which lack a similar level of integration). All of this is covered in significant detail in name99's incredibly well-done look at Apple chip patents. See https://raw.githubusercontent.com/name99-org/AArch64-Explore/main/vol3 SoC.nb.pdf in particular, and search for "SLC" if you don't have the time to go through the whole thing, though I recommend it if you can handle it - it's fascinating.

The L2 handling all the cores in a cluster brings with it certain advantages as well. This is also discussed a lot in name99's analysis.
 
Just a remark about that. A15/M2 introduced SIMD matmul operations which are identical in scope and function to Nvidias GPU tensor extensions. Right now these run on regular SIMD cores on Apple, but who knows what they plan to do in the future.
`simdgroup_matrix` runs 204 FLOPs/core-cycle. Meanwhile NVIDIA Tensor cores run 2040 FLOPs/core-cycle. No matter how you normalize for clock speeds, NVIDIA's 5nm GPU is more power-efficient at AI than Apple's 5nm GPU. Dedicated matrix hardware also costs die space, which they'd rather stick in the AMX/ANE.
 
  • Like
Reactions: Xiao_Xi
Apple Silicon RAM latency is actually rather high — much higher than the usual desktop RAM. This is the result of more complex RAM protocols and interfaces (LPDDR generally has higher latency than DDR). So yes, Apple uses a complex cache hierarchy like everyone else, it's just that Apple uses larger caches in their designs (which again helps with energy efficiency as expensive RAM requests can be batched/delayed). One interesting thing to mention is that Apple doesn't use a traditional shader L3 in their CPUs, but instead a large L2 shader between a CPU cluster (4 cores), so their L2 kind of combines the roles of L2 and L3. There is also a shared system-on-a-chip cache, but its main purpose appears to be handling memory coherency and data sharing between various coprocessors.

And from what I understand Apple's cache latency is not stellar, many x86 CPUs have better latencies (due to caches being smaller is what I gather, but I have very poor understanding of how caches work). Apple can compensate some of this by having larger load/store queues and batching the requests.
I'm unfamiliar with your use of "shader" in this context. If I just translate it as "cache" then your statement makes sense. However, don't underestimate the value of a large cache. If (making up some vaguely plausible numbers here) a typical x86 gets a 92% hit rate against cache in a typical workload (yes I know I'm making the word "typical" carry a heavy load here, but the point is valid), while AS gets a 95% hit rate, that can yield a substantial performance improvement.

Also, the SLC (roughly, L3 cache) has a substantial impact on performance. I don't know if I'd agree with your characterization of its "main purpose" because while you're right that it's extremely important for that reason, the result is inextricably intertwined with meaningful performance gains.

Overall I think better cache hit rates, and a generally smarter cache implementation, is much more important than "batching requests". But every part helps.

It's really hard to pull out specific features and say "this is a win" or "this is a lose". *All* the features are interlinked. If you see something that looks bad to you (say, high cache latency) look closer, and you'll see there's usually a good reason for it - it's a well-considered tradeof (say, taking higher cache latency because you're carrying around a huge set of tags from all the other L2 caches, which lets you avoid a ton of cross-cache coherency chitchat across the NoC... which improves performance and reduces energy usage).
 
`simdgroup_matrix` runs 204 FLOPs/core-cycle. Meanwhile NVIDIA Tensor cores run 2040 FLOPs/core-cycle. No matter how you normalize for clock speeds, NVIDIA's 5nm GPU is more power-efficient at AI than Apple's 5nm GPU. Dedicated matrix hardware also costs die space, which they'd rather stick in the AMX/ANE.

Sure, but who is to say that a future version of Apple Silicon won't introduce some major reshuffling here?

Don't Playstation 5/Xbox X APUs have higher bandwidth than AMD's laptop APUs?

Sure, because the bandwidth also needs to save the GPU.

Doesn't Apple need to increase bandwidth to keep up with the GPU cores? Would an Mx Extreme (2x Mx Ultra) with the same bandwidth as the M1 Ultra make sense?

Depends on what tradeoffs you are willing to live with. For example the RTX 4090 has very low RAM bandwidth per unit of compute (lower than M1 even). But it seems to work just fine for them.
 
  • Like
Reactions: Xiao_Xi
Don't Playstation 5/Xbox X APUs have higher bandwidth than AMD's laptop APUs?
Yes, they're using GDDR instead of (LP)DDR, which has much higher bandwidth. Both Sony and MS decided to spend more money on RAM because without GDDR the graphics simply wouldn't have been good enough, and they didn't want separate graphics and system RAM. They are, in fact, running "Unified" RAM of sorts, though not on the same level as AS.
Doesn't Apple need to increase bandwidth to keep up with the GPU cores? Would an Mx Extreme (2x Mx Ultra) with the same bandwidth as the M1 Ultra make sense?
Probably not. Which is why such a thing would be quite unlikely. You'd likely see even wider RAM, or possibly HBM (though that seems somewhat unlikely).
 
  • Like
Reactions: Xiao_Xi
Doesn't Apple need to increase bandwidth to keep up with the GPU cores?
That's what they did. The M1 Pro has 200GB/sec which are all shared between CPU and GPU. The M1 Max doubles that to 400GB/sec to give the larger GPU more bandwidth. These 400GB/sec are still shared between CPU and GPU, just that the CPU part is limited to the same 200GB/sec as in the M1 Pro, leaving 200GB/sec exclusively for the GPU. (These are maximum numbers that won't always be achieved.)
 
Don't Playstation 5/Xbox X APUs have higher bandwidth than AMD's laptop APUs?
Well, yes, because they have a lot more GPU cores making them more like a GPU with some CPU cores instead of the opposite. That's why the memory subsystem is more similar to that of a GPU than to that of a CPU.
 
  • Like
Reactions: Xiao_Xi
Apple is taking over dedicated hardware again... an M2 macbook pro 13 or a macbook air of that same processor, can easily outperform any high end x86 processor when it comes to image/video processing, oddly enough, even if they didnt add special features for it, even for audio work, right now, the M series chips are far better than anything you can get in the x86 market for professional work... I don't know you, but, I have mac and pc, I have a maxed out mbp M1 max and also have a high end ASUS zephyrus G15. One for work, one for entertainment. I have done some work on the pc, just to take myself that doubt away. My testing has been all real scenarios, using davinci resolve on both computers, and some times FCP vs. Premiere (sadly no FCP for PC) Editing times in terms of render, timeline flow and smoothness are very similar, the mac exeeding precisely in render times and high resolution files readouts in ways that my PC would only be able to dream... in fact, for me to get the similar experience, I would have to opt for an extra USD$ 1k upgrade for my PC to actually be in pair with the mac, and still, the dedicated hardware and encoding tools in the M series, will perform faster on those specifics task than the most nvidia of the nvidia and the most x86 of the x86. For some reason is called DEDICATED HARDWARE. I use my PC as an every day driver, because I use my mac for work, right now, I'm using my PC, typing this, on my asus keyboard. I be honest... If i didnt do what I do for a living, I would stick with a pc, because i wouldnt need what the M series actually offers, in the end, computers are for work or for game, the day to day needs as web browsing, light word processing and social media... they can be done on an iPad, or any other android device. My point is, apple rigth now is like ferrari, they making excellent performance items that actually some little amount of people will get for the show and an even smaller amount will use it for what is meant to be use.
 
  • Like
Reactions: Unregistered 4U
that's probably the only reason. and it's a valid one for most use cases.
It just doesn't bode too well with "ultra pro" workstations, which are a niche in a niche anyway, in terms of marketshare
are u sure the mac pro is a niche in professional video production?
 
Between the issues with the Final Cut Pro to Final Cut Pro X transition, the Trashcan, and Apple generally neglecting the Mac Pro product line, it's safe to say that many have jumped ship. Probably hard to get reliable numbers.
u are forgetting that Workstations dont have the Same Refreshing Cycle are Consumer Computers. Workstations are being kept way longer than ur normal computers.
 
The point for friend whom is on Solar power off grid that Air uses only 5 to 7 Watts during heavy video watching and music playing while surfing! Also he likes the speeds at such low power he loves the unique way of having a low powered Air in his setup with Tesla Internet satellite setup!
 
u are forgetting that Workstations dont have the Same Refreshing Cycle are Consumer Computers. Workstations are being kept way longer than ur normal computers.
Apple didn’t refresh the Mac Pro from 2013 to 2019. The 2019 one is getting lapped by Apple’s own laptops. There’s no way many professionals are sticking with it unless the absolutely have to.
 
frankly I don't understand what "pros" need that the current top spec Mac Studio doesn't offer.

We've basically peaked in terms of M1 processing power. Sure the m2/m3 might top that, but if you're doing insanely large dataset processing (the only thing someone could need MORE processing for) you're already using a GPU cluster. So you can connect a MacBook Air to write the code and run it on that GPU cluster which (certainly) runs linux.
 
  • Sad
Reactions: turbineseaplane
frankly I don't understand what "pros" need that the current top spec Mac Studio doesn't offer.

We've basically peaked in terms of M1 processing power. Sure the m2/m3 might top that, but if you're doing insanely large dataset processing (the only thing someone could need MORE processing for) you're already using a GPU cluster. So you can connect a MacBook Air to write the code and run it on that GPU cluster which (certainly) runs linux.
It's not just raw computing power that pro's need. It's also expansion. Many musicians and recording studios need PCIe expansion cards for multiple low latency audio inputs. Others need fiber channel networking or additional I/O expansion. Others might need a card with legacy I/O for certain devices. Other's might use PCI slots for storage beyond Apple's internally provided storage, and a great many would like to add additional GPU cards to their computer as well.

Now many such cards might work fine in a PCIe enclosure running over Thunderbolt, but for many workflows that will add latency to an unacceptable degree. Having a computer with internal expansion allows it to become whatever the user needs it to be, especially if such a machine is kept and repurposed for another use after its primary use has been superseded by something newer.

Honestly, I hope the next Mac Pro keeps the same amount of PCI slots. When the 2019 Mac Pro came out, it felt a bit nostalgic as Apple hadn't made a machine with that many expansion slots since the good old Apple ][.
 
The extra bandwidth of the Max is just for the GPU. From the CPU capabilities standpoint there is no difference between the Pro and the Max. While Max has a larger SoC cache, you will be hard pressed to find a problem where this will be noticeable.
You are right. They do score about the same. However, variability in benchmarks can sometimes cause the M1 Pro to outperform the M1 Max.

The M1 Pro has 200GB/sec which are all shared between CPU and GPU. The M1 Max doubles that to 400GB/sec to give the larger GPU more bandwidth. These 400GB/sec are still shared between CPU and GPU, just that the CPU part is limited to the same 200GB/sec as in the M1 Pro, leaving 200GB/sec exclusively for the GPU. (These are maximum numbers that won't always be achieved.)
Does the M1 Ultra do the same?
 
Does the M1 Ultra do the same?
Same idea, different numbers. The Ultra multiplies everything by two again. So 800GBps instead of 400GBps, with half only accessible to the GPU (and possibly NPU etc., I don't recall).

The Ultra is basically two Maxes glued together, and it's generally safe to think of it that way. (The "gluing" is an industry-first engineering achievement of considerable note but not really germane to this.)
 
In which sense do you mean "not on the same level"?
IIRC the Xbox and PS5 have much less integration between CPU and GPU cores. So while they share the same unified RAM, they're less efficient about moving stuff between the two. That's because, AFAIK, they have nothing like the AS SLC.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.