What’s the point of ARM?

Gudi · Jan 13, 2023

Xiao_Xi said:
Why do Intel and AMD use lower bandwidth in desktop and notebook CPUs? Costs? Market segmentation?
Can we expect AMD's 7040 series to have similar bandwidth to M2 SoCs?

You can safely ignore everyone who answered costs. The truth is M-series chips have 8 decoders twice as much as all x86-64 cpus. This allows them to fill up the instruction buffer much quicker and actually make use of larger caches and broader bandwidth. Yes all of this costs DIE space and money, but in the end it is the complexity of the instruction set (RISC vs CISC), which determines the max number of decoders you can still manage and the amount of data you can compute per cycle.

Think about higher bandwidth like an 8 lane highway which ends in a 4 lane crossroad. Unless all parts of the traffic system are designed for a higher throughput a broader highway will not solve the traffic jam.

leman · Jan 13, 2023

Gudi said:
You can safely ignore everyone who answered costs. The truth is M-series chips have 8 decoders twice as much as all x86-64 cpus. This allows them to fill up the instruction buffer much quicker and actually make use of larger caches and broader bandwidth. Yes all of this costs DIE space and money, but in the end it is the complexity of the instruction set (RISC vs CISC), which determines the max number of decoders you can still manage and the amount of data you can compute per cycle.

View attachment 2141679

Think about higher bandwidth like an 8 lane highway which ends in a 4 lane crossroad. Unless all parts of the traffic system are designed for a higher throughput a broader highway will not solve the traffic jam.

This has nothing to do with RAM bandwidth. And basic M1/M2 have the same memory bandwidth as x86 CPUs. Finally, you neglect the fact that x86 run higher clocks. So while the frontend can decode fewer instructions per cycle, you have more cycles per second. The end effect is not that different. In fact, both Apple Silicon and Zen4 fetch 32 bytes per cycle from the instruction cache. So what you write has no basis in reality.

Look, I get it. You like Apple tech. I like it too. In term of actual technology and sophistication Apple is years ahead of any x86 CPU technology, who still rely on brute force too much. But it really doesn’t help that you freely invent frankly ridiculous stuff to make Apple look better.

Basic75 · Jan 13, 2023

leman said:
This has nothing to do with RAM bandwidth. And basic M1/M2 have the same memory bandwidth as x86 CPUs. Finally, you neglect the fact that x86 run higher clocks. So while the frontend can decode fewer instructions per cycle, you have more cycles per second. The end effect is not that different. In fact, both Apple Silicon and Zen4 fetch 32 bytes per cycle from the instruction cache. So what you write has no basis in reality.

Look, I get it. You like Apple tech. I like it too. In term of actual technology and sophistication Apple is years ahead of any x86 CPU technology, who still rely on brute force too much. But it really doesn’t help that you freely invent frankly ridiculous stuff to make Apple look better.

Quite right, and things like micro-op caches exist, too. https://news.ycombinator.com/item?id=25394447

Xiao_Xi · Jan 13, 2023

leman said:
Apple's cache latency is not stellar, many x86 CPUs have better latencies (due to caches being smaller is what I gather, but I have very poor understanding of how caches work). Apple can compensate some of this by having larger load/store queues and batching the requests.

ChipsAndCheese did a comparison on this.

iGPU Cache Setups Compared, Including M1

Like CPUs, modern GPUs have evolved to use complex, multi level cache hierarchies.

chipsandcheese.com

leman said:
In fact, both Apple Silicon and Zen4 fetch 32 bytes per cycle from the instruction cache. So what you write has no basis in reality.

AMD’s Zen 4, Part 2: Memory Subsystem and Conclusion

Please go through part 1 of our Zen 4 coverage, if you haven’t done so already. This article picks up where the previous one left off. To summarize, Zen 4 has made several moves in the fronte…

chipsandcheese.com

avkills · Jan 13, 2023

maflynn said:
Can they pull the plug on Rosetta when they're still selling Intel based Macs or would that not matter?

Rosetta has nothing to do with Intel based Macs; so pulling the plug on that isn't going to do much other than force developers to make Apple Silicon versions of their apps, which would actually be a good thing. If anything it would cause frustration for users on Apple Silicon and stubborn developers.

Devs can still release apps with binaries for Apple Silicon and Intel; happened for quite some time during the PPC > Intel transition.

Apple *should* support Intel machines for at least another 2-3 years; if not more; but it is Apple so they probably will not.

avkills · Jan 13, 2023

boss.king said:
Apple didn’t refresh the Mac Pro from 2013 to 2019. The 2019 one is getting lapped by Apple’s own laptops. There’s no way many professionals are sticking with it unless the absolutely have to.

Really depends on the workflow and what is being done; and what model 2019 your are talking about. Considering the spend amount on the 2019 Mac Pro; I am sticking with it until I can't anymore.

Apple's GPUs are still weak compared to discreet desktop GPUs in overall performance. The only saving grace with the Apple GPUs is the massive amounts of RAM available to them due to the unified architecture.

Gudi · Jan 13, 2023

leman said:
Look, I get it. You like Apple tech. I like it too. In term of actual technology and sophistication Apple is years ahead of any x86 CPU technology, who still rely on brute force too much. But it really doesn’t help that you freely invent frankly ridiculous stuff to make Apple look better.

Better than what? Apart from color and storage size there's literally nothing to chose from for your Mac. Only PC freaks are obsessed with comparing something that doesn't even fit in the same category. What's the point of ARM? What's the point of RISC? Couldn't Intel simply increase bandwidth to match Apple Silicon? No, no they can not. Those are two fundamentally different CPUs running two fundamentally different OSs. All comparisons are futile. All you can do is try to understand how the M1 achieves its performance. If you don't care to know and to learn, that's entirely your problem!

Xiao_Xi · Jan 13, 2023

Gudi said:
Couldn't Intel simply increase bandwidth to match Apple Silicon? No, no they can not. Those are two fundamentally different CPUs running two fundamentally different OSs.

How do you explain AMD increasing bandwidth for console APUs and not for notebook CPUs?

Gudi · Jan 13, 2023

Xiao_Xi said:
How do you explain AMD increasing bandwidth for console APUs and not for notebook CPUs?

Because x86 laptop chips are running slower anyway to keep them from overheating and sucking dry the battery? Whereas a console has a giant fan and is plugged into the wall? What do you think!

Gudi · Jan 13, 2023

Look, you can make any old chip look fast, if you pump endless energy into it and endless heat out of it. Intel’s new desktop processor reaches 6GHz in TurboBoost mode for 5 minutes on 1 core and then it catches fire and burns down the house. If you're Apple and you want to increase performance per watt, than you need to increase the calculations per cycle, not the clock speed.

Xiao_Xi · Jan 13, 2023

Gudi said:
Because x86 laptop chips are running slower anyway to keep them from overheating and sucking dry the battery? Whereas a console has a giant fan and is plugged into the wall?

What is the relationship between power consumption and bandwidth? Isn't bandwidth an implementation choice?

Gudi · Jan 13, 2023

Xiao_Xi said:
What is the relationship between power consumption and bandwidth? Isn't bandwidth an implementation choice?

On a laptop you're limited on battery life and heat dissipation. So you can't use a highly clocked high performant x86 desktop CPU, because it would thermal throttle immediately. You need a special x86 laptop CPU, which is much slower to lower power consumption and couldn't utilize a larger bandwidth anyway.

Because the M1 has so much better performance per watt, it can even run in a fanless iPad and MacBook Air without overheating and thermal throttling. The bandwidth isn't wasted like it would on an x86 chip in a fanless chassis.

Lounge vibes 05 · Jan 13, 2023

turbineseaplane said:
I still expect at least another full year+ here (almost 2 really)
(i.e. I doubt they'll be cutting off 2019 MacPro's from the upcoming macOS release in the fall of 2023)

oh definitely, the current OS supports devices from 2017 and up.
IE: if your Mac was first made available after June 5, 2017.
So basically, five years of updates from when the device first launched.
With that in mind, the last Intel Mac was introduced in August 2020.
Five years after that is when I expect them to drop support for any pre-M1 Mac.
So macOS 16 in 2025.
Then probably complete removal of any Intel application going through Rosetta in 2026 with macOS 17.

Philip Turner · Jan 13, 2023

Xiao_Xi said:
ChipsAndCheese did a comparison on this.

iGPU Cache Setups Compared, Including M1

Like CPUs, modern GPUs have evolved to use complex, multi level cache hierarchies.

chipsandcheese.com

AMD’s Zen 4, Part 2: Memory Subsystem and Conclusion

Please go through part 1 of our Zen 4 coverage, if you haven’t done so already. This article picks up where the previous one left off. To summarize, Zen 4 has made several moves in the fronte…

chipsandcheese.com

I used these kinds of sources to reverse-engineer the AGX's memory system. Apple, AMD, and NVIDIA GPU cores can all read 64 B per cycle from L1D with 128 B cache lines. Here's some stats for the benchmarked M1 models and an AMD APU.

BW/Core-Cycle	AMD Zen 4 CPU	AMD 4800H GPU	M1 Max/M2 CPU	M1 Base GPU
L1D	64 B (2 AVX-256)	64 B	48 B (3 NEON-128)	64 B
L2D	32 B (1 AVX-256)	32 B	32 B (2 NEON-128)	32 B
L3D	~27.1 B	~27.0 B	32 B (2 NEON-128)	~106.4 B
RAM	~10.0 B	~27.0 B	32 B (2 NEON-128)	~48.3 B

L3D and RAM bandwidths are theoretical for GPU, probably the 32 B/core-cycle that's universal across the chart. Since these are bottlenecked by clock speed, one GPU core cannot read that many bytes/cycle.

We have a reason for the M1 Pro being limited in single-core bandwidth. RAM is as fast as L2 cache for single-core CPU! There is no artificial limitation saying "one CPU core can't touch this bandwidth". It's the same situation as GPU. This means the Zen 4 CPU will be memory-bottlenecked in workloads where the M1 CPU is not. It needs 3x the arithmetic intensity.

However, L1 data-ey workloads prevent all 4 SIMD ALUs of the M1 from being fully utilized (1 fetched operand/instruction). With AMD they'll all be fully utilized. On any GPU, SIMD capability vastly exceeds L1D or threadgroup bandwidth (often the same subsystem). Even register bandwidth isn't enough, requiring a "register cache" sized closer to a CPU's actual registers.

Regarding the L1I, the M1 GPU could literally read all its instructions from main memory and not sweat* (8 B x 4 IPC = 32 B). Only drops from 98% GFLOPS to 78% (FFMA32) or ~73% (FCMPSEL32). ~~If ARM64 instructions are 8 B, the CPU could probably do this too.~~ Better, ARM64 instructions are 4 B. However, CPUs can't hide instruction fetching latency like GPUs.

* I vaguely recall ~1.5-2x deltas for seriously bloated transcendental functions. This probably depends on the instruction and its register dependency characteristics.

jnngr · Jan 13, 2023

maflynn said:
The OP is not a new member, so that's not the situation here. I think its a fair question, especially since Intel/amd Nvidia/AMD (gpus) have clear advantages in the desktop segment. The ARM processor that Apple designed, is first and foremost a mobile processor. Its roots are from Apple's phone processors, and many of the advantages built into the processor (that make it a great laptop cpu) don't hold as much value in desktop or workstation settings.

Actually its roots are in Acorn desktop computers.

Lioness~ · Jan 13, 2023

ARM will lift some weights, but you won’t go anywhere until you get a, or preferably 2, LEG.
We'll see when that walks into our Mac’s 😉

Edit: Just kidding, of course, I'm here to learn.

leman · Jan 13, 2023

Philip Turner said:
However, L1 data-ey workloads prevent all 4 SIMD ALUs of the M1 from being fully utilized (1 fetched operand/instruction).

A quick comment on this. If I correctly read name99‘s Apple Silicon deep dive, it can load 4 SIMD ALUs per cycle, but one of the loads has to come from the store queue. So if you have a store to memory followed shortly by a load from the same address, this can be loaded „for free“.

Pressure · Jan 14, 2023

Xiao_Xi said:
Any link to support that Mx SoCs are more die space efficient?

The top level M1 Ultra (114B) requires many more transistors than AMD Ryzen 9 7950X (13.1B) and Nvidia RTX 4090 (76.3B) combined.

AMD Ryzen 9 7950X Specs

Raphael, 16 Cores, 32 Threads, 4.5 GHz, 170 W

www.techpowerup.com

NVIDIA GeForce RTX 4090 Specs

NVIDIA AD102, 2520 MHz, 16384 Cores, 512 TMUs, 176 ROPs, 24576 MB GDDR6X, 1313 MHz, 384 bit

www.techpowerup.com

Xiao_Xi said:
Since Apple uses multiple coprocessors, the Mx SoC tends to be very large. In fact, there must be very few chips larger than the M1 Ultra. One of them is the AMD MI300, a giant APU.

If we only compare CPU cores, are Apple's CPU cores bigger/smaller than Intel/AMD's?

Xiao_Xi said:
The more I read about benchmarks, the less sense I find them. For example, M1 Pro crushes Ryzen in GCC (specint2017), but has similar performance in Clang (Geekbench 5). I wouldn't expect that much disparity in such a similar workload.

(Edited for better visualization)
View attachment 2140848

View attachment 2140839

ASUSTeK COMPUTER INC. ROG Flow X13 GV301QH_GV301QH vs MacBook Pro (14-inch, 2021) - Geekbench

Anandtech's analysis explains that the difference between the M1 Pro/Max and AMD/Intel CPUs is greater in memory-bound workloads than in compute-bound workloads.

Is the huge memory bandwidth that Mx SoCs have more important to the success of Apple Silicon than the use of ARM ISA?

Look at the amount of SRAM in the Apple SoCs to better understand where much of the transistor budget is tied up. The amount of System Level Cache, Level 2 cache as well as the GPU cache is pretty wild.

Xiao_Xi said:
Doesn't Apple need to increase bandwidth to keep up with the GPU cores? Would an Mx Extreme (2x Mx Ultra) with the same bandwidth as the M1 Ultra make sense?

leman said:
Depends on what tradeoffs you are willing to live with. For example the RTX 4090 has very low RAM bandwidth per unit of compute (lower than M1 even). But it seems to work just fine for them.

It should be further noted that the RTX 4090 is what is known as an "Immediate Renderer" while the GPU architecture Apple is using is a "Tile-based Deferred Renderer".

The latter needs less bandwidth and power because it breaks the scene down in tiles and only processes what is "visible" for the viewer and is able to fit many tiles into "tile memory" which is on-die and thus saving an expensive trip back and forth from the device memory (with regards to both latency and power). The drawback is obviously that an Apple GPU features a lot more SRAM / cache and the performance is dependent on getting a high utilisation across all ALUs to get optimal performance.

You can read more about Apple's approach here: Tailor Your Apps for Apple GPUs and Tile-Based Deferred Rendering.

Apple highlights the following advantages:

Bandwidth that’s many times faster than device memory
Access latency that’s many times lower than device memory
Energy consumption that’s significantly less than accessing device memory

leman · Jan 14, 2023

Pressure said:
It should be further noted that the RTX 4090 is what is known as an "Immediate Renderer" while the GPU architecture Apple is using is a "Tile-based Deferred Renderer".

The latter needs less bandwidth and power because it breaks the scene down in tiles and only processes what is "visible" for the viewer and is able to fit many tiles into "tile memory" which is on-die and thus saving an expensive trip back and forth from the device memory (with regards to both latency and power). The drawback is obviously that an Apple GPU features a lot more SRAM / cache and the performance is dependent on getting a high utilisation across all ALUs to get optimal performance.

You can read more about Apple's approach here: Tailor Your Apps for Apple GPUs and Tile-Based Deferred Rendering.

Apple highlights the following advantages:

Bandwidth that’s many times faster than device memory

Access latency that’s many times lower than device memory

Energy consumption that’s significantly less than accessing device memory

Things have changed a bit recently. The Ada series from Nvidia has massively increased cache sizes. So it's not true anymore that Nvidia has smaller caches than Apple Silicon.

As to TBDR, it can save a lot of work on rasterisation workloads with a lot of overdraw as well as improve cache locality, at the same time modern games do a lot of post-processing effects and use shading-via-compute methods, which make TBDR less of an advantage.

Xiao_Xi · Jan 14, 2023

leman said:
As to TBDR, it can save a lot of work on rasterisation workloads with a lot of overdraw as well as improve cache locality

How? Don't an IR GPU and a TBDR GPU do the same thing?

Basic75 · Jan 14, 2023

Xiao_Xi said:
How? Don't an IR GPU and a TBDR GPU do the same thing?

They do the same things, but organised differently, which creates a different set of trade-offs.

Xiao_Xi · Jan 14, 2023

Basic75 said:
They do the same things, but organised differently, which creates a different set of trade-offs.

I worded that sentence very poorly. Isn't the difference between a TBDR GPU and an IM GPU that the TBDR GPU splits the workload that an IM GPU does into smaller chunks? If so, how could an IM GPU require more power if they both do the same work?

darthaddie · Jan 14, 2023

senttoschool said:
This isn't referring the OP specifically, but we're seeing a lot of new members here who are essentially trolls. Usually, they are x86 and Nvidia supremacists who can't stand that Apple Silicon is better than what they can buy in the Windows world. I wonder if mods will do anything about it here.

As an owner of a Mac Studio ultra (48) and a PC with 13900k/4090, I can confirm. Except heavy GPU effects in Davinci, where the 4090 excels by far, everything else is pretty much faster/on par between the 2 systems. I do use both depending on the project and workload but it’s amazing to see the mac studio stand ground to a newer and 4 times as powerful GPU. On Lightroom the AI masking tools (GPU heavy) are only 20% slower on the Mac Studio. The power of hardware/software optimization is no joke. Gaming is a different story but the M2/M3 can easily catch up there.

Unregistered 4U · Jan 14, 2023

darthaddie said:
Gaming is a different story but the M2/M3 can easily catch up there.

I’m assuming the same with GPU effects, too. Any code written for PC GPU’s has been heavily optimized over the years. It’s going to take some time before code is fully taking advantage of the hardware (for example, I’d imagine most of the cross platform code doesn’t fully take into account a system that could potentially have a non-compute GPU with 70 Gigs of RAM attached to the CPU because that’s not available outside Apple Silicon).

Joe Dohn · Jan 14, 2023

Unregistered 4U said:
I’m assuming the same with GPU effects, too. Any code written for PC GPU’s has been heavily optimized over the years. It’s going to take some time before code is fully taking advantage of the hardware (for example, I’d imagine most of the cross platform code doesn’t fully take into account a system that could potentially have a non-compute GPU with 70 Gigs of RAM attached to the CPU because that’s not available outside Apple Silicon).

If you need that much video RAM and you have money to spare, all it takes is connecting multiple GPUs together, which you CAN do in the PC platform.

What’s the point of ARM?

Suspended

macrumors Core

macrumors 68020

macrumors 68000

macrumors 65816

macrumors 65816

Suspended

macrumors 68000

Suspended

Suspended

macrumors 68000

Suspended

macrumors 601

macrumors regular

macrumors member

macrumors 601

macrumors Core

macrumors 603

macrumors Core

macrumors 68000

macrumors 68020

macrumors 68000

macrumors regular

macrumors G4

macrumors 6502a

Our Staff