i don't know about very early architecture, but even my first IBM compatible PC (80286 @8MHz) had to do wait cycles for the RAM.
no idea about my computers before that, but i guess it was the same there too.
can't remember if it had cache memory, but my next one (80386 @40MHz) most certainly had, to somewhat compensate for the way too slow RAM. and of course the RAM itself had tons of waitstate again set up in the BIOS. when i set those too low, they system became quicker, but also started to become very glitchy.
of course that was about latency, and not necessarily about bandwith.
so do Apple Silicon chips still use some additional more responsive cache memory, or is the system RAM's latency actually low enough this time around so it's not the bottleneck for the CPU/GPU anymore?
It was about both latency and bandwidth. Or rather, it was about latency, but it affected bandwidth.
(Random recollection: Do you remember that caches used to be separate chips on motherboards, back when chipsets were actual sets of chips?)
RAM latency will never NOT be a bottleneck, at least unless/until we develop some fundamentally new tech. I don't see that happening any time soon.
RAM latency is enormous compared to the caches, and even more so compared to registers. It's no different for AS than for Intel or AMD chips. Well, it's a little different - I think it's higher for AS, at least so far, because all AS chips to date are using LPDDR. However there are potentially compensating advantages (twice the channels compared to same-size DDR).
I don't think Apple's caches are "more responsive" but I'm to lazy to go look up latencies for all the different architectures. However they're definitely larger than contemporaries. And that matters a lot. Larger caches -> higher hit rates = fewer cache misses -> fewer CPU stalls. Generally speaking, larger caches are slower, so there's a tradeoff here. Apple still seems to have done a better job with caches overall.
Apple's caching is also, arguably, a lot smarter by design. In particular the SLC plays a huge role in coherency, and being tightly linked to all the SoC elements, not just the cores, gives AS a huge advantage over other architectures in certain common situations (even against AMD APUs, which lack a similar level of integration). All of this is covered in significant detail in name99's incredibly well-done look at Apple chip patents. See
https://raw.githubusercontent.com/name99-org/AArch64-Explore/main/vol3 SoC.nb.pdf in particular, and search for "SLC" if you don't have the time to go through the whole thing, though I recommend it if you can handle it - it's fascinating.
The L2 handling all the cores in a cluster brings with it certain advantages as well. This is also discussed a lot in name99's analysis.