ChipsAndCheese did a comparison on this.
Like CPUs, modern GPUs have evolved to use complex, multi level cache hierarchies.
chipsandcheese.com
Please go through part 1 of our Zen 4 coverage, if you haven’t done so already. This article picks up where the previous one left off. To summarize, Zen 4 has made several moves in the fronte…
chipsandcheese.com
I used these kinds of sources to reverse-engineer the AGX's memory system. Apple, AMD, and NVIDIA GPU cores can all read 64 B per cycle from L1D with 128 B cache lines. Here's some stats for the benchmarked M1 models and an AMD APU.
BW/Core-Cycle | AMD Zen 4 CPU | AMD 4800H GPU | M1 Max/M2 CPU | M1 Base GPU |
L1D | 64 B (2 AVX-256) | 64 B | 48 B (3 NEON-128) | 64 B |
L2D | 32 B (1 AVX-256) | 32 B | 32 B (2 NEON-128) | 32 B |
L3D | ~27.1 B | ~27.0 B | 32 B (2 NEON-128) | ~106.4 B |
RAM | ~10.0 B | ~27.0 B | 32 B (2 NEON-128) | ~48.3 B |
L3D and RAM bandwidths are theoretical for GPU, probably the 32 B/core-cycle that's universal across the chart. Since these are bottlenecked by clock speed, one GPU core cannot read that many bytes/cycle.
We have a reason for the M1 Pro being limited in single-core bandwidth. RAM is as fast as L2 cache for single-core CPU! There is no artificial limitation saying "one CPU core can't touch this bandwidth". It's the same situation as GPU. This means the Zen 4 CPU will be memory-bottlenecked in workloads where the M1 CPU is not. It needs 3x the arithmetic intensity.
However, L1 data-ey workloads prevent all 4 SIMD ALUs of the M1 from being fully utilized (1 fetched operand/instruction). With AMD they'll all be fully utilized. On any GPU, SIMD capability vastly exceeds L1D or threadgroup bandwidth (often the same subsystem). Even register bandwidth isn't enough, requiring a "register cache" sized closer to a CPU's actual registers.
Regarding the L1I, the M1 GPU could literally read all its instructions from main memory and not sweat* (8 B x 4 IPC = 32 B). Only drops from 98% GFLOPS to 78% (FFMA32) or ~73% (FCMPSEL32).
If ARM64 instructions are 8 B, the CPU could probably do this too. Better, ARM64 instructions are 4 B. However, CPUs can't hide instruction fetching latency like GPUs.
* I vaguely recall ~1.5-2x deltas for seriously bloated transcendental functions. This probably depends on the instruction and its register dependency characteristics.