What’s the point of ARM?

TechnoMonk · Jan 12, 2023

Zest28 said:
For laptops, I get it. ARM offers nice battery life, but a Mac Pro has no battery life.

My workstation has 4090, and it used to be 3090 a few weeks back. More power generates more heat. Those things throttle when running serious AI stuff and shut down a few times a week. After Apple released core-ml conversion tools, my M1 Max MacBook pro is a decent substitute, with no heat has never throttled or shut down on me. The AMD thread ripper 3960X CPU is a furnace under heavy loads.
Unified memory is a big deal, often, my 4090 runs out of memory, but My M1 max with 64 GB unified memory works just fine though a bit slower. I wonder if Apple wants to compete with NVIDIA A100 or other high-end GPUs, where the real training and model generation happens. I would love to see a Mac pro with unified memory of 256 Gb or more; that would be a game-changer.
M1 Max is slightly slower but only option with memory intensive stuff. I dont want to use more expensive A100 in the cloud for anything other than training.

TechnoMonk · Jan 12, 2023

Xiao_Xi said:
The more I read about benchmarks, the less sense I find them. For example, M1 Pro crushes Ryzen in GCC (specint2017), but has similar performance in Clang (Geekbench 5). I wouldn't expect that much disparity in such a similar workload.

(Edited for better visualization)
View attachment 2140848

View attachment 2140839

ASUSTeK COMPUTER INC. ROG Flow X13 GV301QH_GV301QH vs MacBook Pro (14-inch, 2021) - Geekbench

browser.geekbench.com

Anandtech's analysis explains that the difference between the M1 Pro/Max and AMD/Intel CPUs is greater in memory-bound workloads than in compute-bound workloads.

Is the huge memory bandwidth that Mx SoCs have more important to the success of Apple Silicon than the use of ARM ISA?

Memory bandwidth, and having unified memory for CPU and GPU is a huge difference in real life outside of benchmarks. The 4090 may have higher benchmarks, but I am so frustrated right now with it running out of memory and throttling. If Apple can make Mac Pro with 256 GB or more memory and double the GPU Cores with low power consumption, it would be a game changer.

Think77 · Jan 12, 2023

maflynn said:
The M1 is first and foremost a mobile processor, every design choice screams mobility imo.

How come, then, that I’m recording, mixing and mastering an entire album on my stationary Mac Mini in my home studio right now? 😊 Just sayin’ 😉

quarkysg · Jan 12, 2023

Xiao_Xi said:
Is the huge memory bandwidth that Mx SoCs have more important to the success of Apple Silicon than the use of ARM ISA?

IMHO, the world is very used to the traditional Von Neumann architecture with narrow pipes, because those narrow pipes provides the benefit of modularity at the expense of performance. Also all these started with very slow CPUs, so initially it was the CPU that was the bottleneck. The CPU core has advance much faster than the advancement of those "pipes" and it's surrounding components (e.g. memory, PCIe, etc.)

What Apple is trying to achieve IMHO, is to progressively remove those bottleneck by widening the "pipes". As an anology, you don't need to breathe as fast (CPU clock speed) if you enlarge your wind pipe (memory channel), since more air (data) can go into your lungs, as compared to when you have a narrower wind pipe, irrespective of how large your lungs (CPU) are.

Huge bandwidth in important to any CPU architecture. ISA does not affect how much bandwidth is required as much. The CPU u-arch is the one that determines the bandwidth it needs as well as the internal cache fabric (L1/L2/L3). There is a reason why Intel and AMD server CPUs have many more memory channels compared to their top end consumer grade CPUs.

You argument in distrusting Anandtech's SPEC figures because it uses WSL is probably unwarranted IMHO. If my understanding is correct, SPEC does not measure OS capability. Rather it tries to measure a CPU's and it's surrounding subsystem's capability. Once data is loaded into memory, the OS basically gets out of the way, unless of course the OS gets in the way when shuffling data around memory.

On the other hand, SPEC has been developed to run in a Unix environment, so it rely on how optimal the system libraries has been designed (e.g. glibc) and optimised. So it may turn out that WSL' system libraries (which likely is a shim to Windows's kernel APIs) may be sub-optimal, but I don't think it will be that bad, since it will be mainly just translating one API call to another. I don't think SPEC's components are run with constant calls into system APIs.

Anyway, my 2 cents worth.

Corefile · Jan 12, 2023

Zest28 said:
Does Apple really believe the M2 Extreme would beat a 192-core AMD CPU and a RTX 4090? Heck, you can probably put multiple RTX 4090 in the Mac Pro even (if Apple solves their politics with NVIDIA).

You can't put your flagship product on the market shipping with the competition's CPU. That would be acknowledgement that x86-64 is better technology than ARM for serious professionals.

Atomic1977 · Jan 12, 2023

Perhaps the better question to ask is what’s the point of anything? As we all know most technology is made obsolete shortly after you buy it and something better comes out shortly thereafter.

broly · Jan 12, 2023

Atomic1977 said:
Perhaps the better question to ask is what’s the point of anything? As we all know most technology is made obsolete shortly after you buy it and something better comes out shortly thereafter.

sure

but i bet you i can still run well-written software compiled for an i386 on win95, linux or mac in windows 7 (possibly higher), linux or mac. this applies to MIPS as well but since it was never running on MAC, only the linux and nt variants (however long they lasted) would apply. a similar argument may hold for PPC as well but i'm not well-versed on this architecture.

the problem isn't that APPUL chose to move to a new processor.

the problem is that they chose a design that historically has no legacy support and was viewed as "disposable".

the typical arm application was often targeted, cheap and came with a decent c compiler. that's all you needed. the concept of long-term binary support was never on the table.

this is one huge problem that APPUL cannot ignore. PPC did not have this problem. it will come to roost, but by that time the stock hysteria they are desperately trying to prolong will have subsided (wherever that may lead).

everything they're doing now is short term. i am no steve jobs fan (gates has been my hero since 1996), but there is NO WAY he would have chosen to move to ARM just because of the legacy binary incompatibility.

riverfreak · Jan 12, 2023

what’s the point of arm?

Hand.

End. Of. Thread.

Philip Turner · Jan 12, 2023

riverfreak said:
what’s the point of arm?

LEG

Philip Turner · Jan 12, 2023

TechnoMonk said:
My workstation has 4090, and it used to be 3090 a few weeks back. More power generates more heat. Those things throttle when running serious AI stuff and shut down a few times a week. After Apple released core-ml conversion tools, my M1 Max MacBook pro is a decent substitute, with no heat has never throttled or shut down on me. The AMD thread ripper 3960X CPU is a furnace under heavy loads.
Unified memory is a big deal, often, my 4090 runs out of memory, but My M1 max with 64 GB unified memory works just fine though a bit slower. I wonder if Apple wants to compete with NVIDIA A100 or other high-end GPUs, where the real training and model generation happens. I would love to see a Mac pro with unified memory of 256 Gb or more; that would be a game-changer.
M1 Max is slightly slower but only option with memory intensive stuff. I dont want to use more expensive A100 in the cloud for anything other than training.

This confirmation bias eases my mind... the M1 GPU is the best architecture!

https://github.com/philipturner/metal-benchmarks

Apple's not going to make the GPU better for training. However, they might spec up the Neural Engine, and hype around FP8 spawned algorithms for training in FP16. It's cooler and sometimes faster for running Stable Diffusion too. I did some rough estimates, and you could run SD nonstop on the ANE for several hours, and not run out of battery.

Confused-User · Jan 12, 2023

So far everyone seems to have missed a key point here.

We keep hearing people saying of the AS chips "oh, they're great in mobile, but they can't play at the high end". And so far that's true, but not for the reasons most often stated. So far, Apple's efforts have had varying success at the high end - which would be only the Ultra, a chip with some brilliant engineering but which barely begins to do what they'd need to do to succeed with a really large number of CPU cores (32-96, let's say).

The problem in the Ultra, and in any larger chip they'd design, is scaling efficiency. I (and name99, who's got even more to say on this) have mentioned this in the past.

What everyone is missing is that the wide/slow core design that Apple's been improving incrementally over time is *exactly* what you'd like to see in a really big multicore chip. You want good performance at low power, so that when you put 64 of them next to each other you don't initiate fusion...

Right now, if you look at large AMD and Intel chips, there are two key limitations on multicore performance:
1) uncore power/scaling: the uncore itself sucks down a ton of power, and it limits the scaling you can get from more cores, depending of course on your software.
2) (the key here) clocks/power for each individual core: the largest x86 chips can't run all their cores at the clocks they're capable of, because they'd melt. So talking about 5.xGHz is stupid, you're not going to see that in practice if you're really running all your cores, even if each of those cores is capable of sustaining that speed on its own.

Think about this: Say you have an Intel core that can run at 6GHz, and at that speed has a 20% absolute performance advantage over an AS core running at 3.5GHz. If you could put 64 of them in a chip, which core would you pick? The AS, and it's a no-brainer unless you intend to run the chip at 15% utilization or less, because if you push both chips with as much power as you can without melting them, the AS chip will do much more total work. (Max single-thread is a different story, and we're not talking about that here, of course, as that's not the point of 64-core chips!)

Obviously Apple must scale up their uncore much better than they did in their M1 Ultra. It's an open question how well they can do this although the scaling in the M2 is a hopeful sign. We should know soon.

It may well be that Apple never builds a really big multicore chip. Perhaps they've decided that Ultra is as big as they need to go, and the higher end of the market isn't worth the effort. I really hope not! But it's possible. However, if they do go there, they have easily the best CPU cores to do it with. GPU's an open question, but I'm optimistic.

So in the end, the answer to "why ARM" which really means "why should Apple build their own cores" is that they've already shown they can do it better than any x86 on the low end (mobile) and on the high end (many-cores, *if* they scale better and if they chose to do it). The current state of play may not look great for desktops but it's just transitional, and drawn out by covid, TSMC, and supply chain woes.

Confused-User · Jan 12, 2023

One more thing: Everyone posting about how they can't allow expandable memory because of the unified design is just too ignorant to have an opinion worth listening to.

Does that mean Apple will allow expandable memory in their AS Pro? No way to know until it comes out. They might not, because they might think it's not worth chasing the large-memory market, which has costly machines but with low volume. But they certainly *can* do it.

There are a number of options, which include multi-tiered memory. For example, they could have some on-package RAM (say, up to 256GB), and then a larger pool of off-package RAM. Someone was posting earlier about a scheme like this, where the off-package RAM would be used as "swap". No, you wouldn't do that, because that's an ugly and slow technique, and Apple already has better hardware and software to deal with multiple tiers of memory. But it's not so far off the mark.

Intel has already done something similar in their current high-end servers, which support large amounts of "Optane" DIMM "RAM". Of course, it's not exactly RAM - it's halfway between that and flash - but it's similar in that it's a different storage tier, being slower and larger than regular system RAM (and also NV, though that's not relevant here).

Interestingly, they've cancelled that product line and future generations of Xeons won't support that, but the underlying idea is sound. It was just a marketing fail (because it was a cost/benefit fail - that particular tech never reached the potential they talked up for years).

People often oversell Apple's advantage from integration of hardware and software. That wasn't, for example, why the M1 was so good. It was just better hardware, at the time. But in *this* case, it's key. Being able to built the OS and hardware for mutitiered memory together (if they choose to do that) is a huge competitive advantage.

Of course, they don't have to go there. They could easily just go with off-package RAM, maybe with extra cache to hide some of the extra latency. Nothing about that says it couldn't be unified. The much harder question is what they do if they support add-in GPUs. Again, there are multiple possible solutions, but I have no idea what they'd choose to do. The next few months will be fascinating!

Xiao_Xi · Jan 12, 2023

quarkysg said:
On the other hand, SPEC has been developed to run in a Unix environment, so it rely on how optimal the system libraries has been designed (e.g. glibc) and optimised. So it may turn out that WSL' system libraries (which likely is a shim to Windows's kernel APIs) may be sub-optimal, but I don't think it will be that bad, since it will be mainly just translating one API call to another. I don't think SPEC's components are run with constant calls into system APIs.

Anandtech doesn't explain which WSL it uses. WSL 1 translates API calls, WSL 2 virtualizes Linux.

Windows 11 WSL2 Performance Is Quite Competitive Against Ubuntu 20.04 LTS / Ubuntu 21.10 - Phoronix

www.phoronix.com

WSL has some odd quirks, with one test not running due to a WSL fixed stack size, but for like-for-like testing is good enough.

That sentence makes me believe that they intended to use those results to compare only AMD and Intel CPUs on Windows as Anandtech main audience is gamers.

quarkysg said:
Huge bandwidth in important to any CPU architecture. ISA does not affect how much bandwidth is required as much. The CPU u-arch is the one that determines the bandwidth it needs as well as the internal cache fabric (L1/L2/L3). There is a reason why Intel and AMD server CPUs have many more memory channels compared to their top end consumer grade CPUs.

Why do Intel and AMD use lower bandwidth in desktop and notebook CPUs? Costs? Market segmentation?
Can we expect AMD's 7040 series to have similar bandwidth to M2 SoCs?

quarkysg · Jan 12, 2023

Xiao_Xi said:
Why do Intel and AMD use lower bandwidth in desktop and notebook CPUs? Costs? Market segmentation?
Can we expect AMD's 7040 series to have similar bandwidth to M2 SoCs?

I'm no industry expert, but my guess would be cost. I would think that motherboard manufacturers would balk at AMD and/or Intel if they were told that they have to design and manufacture boards with 512-bits of memory bus (which translates to 8 DIMM slots minimally and that all 8 slots has to be populated) to get high thruput. I suspect the board will cost much more than the CPU itself, as laying out all those data lines and not get crosstalks would be a monumental task. There's a reason why server boards are expensive because those have much wider pipes.

As far as I know (which may be outdated) most consumer boards only require DIMMs to be installed in pairs, which translate to 128-bits of data bus lanes.

Edit: As far as I know, the Mx Apple SoC has the highest memory bandwidth to date for any consumer grade products. I did not read the spec of the 7040 series, but I doubt that they will have more than 256-bit of data bus size. I think most boards will have better memory thruput when matching DIMMs are installed in fours.

MrGunny94 · Jan 13, 2023

maflynn said:
Can they pull the plug on Rosetta when they're still selling Intel based Macs or would that not matter?

Well if they don't do it soon at least until after WWDC 2023, we'll keep getting apps like What's App and others not getting an updated version because that's just how things are in these big tech companies.

leman · Jan 13, 2023

Xiao_Xi said:
Is the huge memory bandwidth that Mx SoCs have more important to the success of Apple Silicon than the use of ARM ISA?

It’s about the implementation, not ISA. Apple CPUs are fast and energy efficient because their core design is more advanced than anyone else’s. And because Apple throws a lot of money on them.

broly said:
sure

but i bet you i can still run well-written software compiled for an i386 on win95, linux or mac in windows 7 (possibly higher), linux or mac. this applies to MIPS as well but since it was never running on MAC, only the linux and nt variants (however long they lasted) would apply. a similar argument may hold for PPC as well but i'm not well-versed on this architecture.

the problem isn't that APPUL chose to move to a new processor.

the problem is that they chose a design that historically has no legacy support and was viewed as "disposable".

the typical arm application was often targeted, cheap and came with a decent c compiler. that's all you needed. the concept of long-term binary support was never on the table.

this is one huge problem that APPUL cannot ignore. PPC did not have this problem. it will come to roost, but by that time the stock hysteria they are desperately trying to prolong will have subsided (wherever that may lead).

everything they're doing now is short term. i am no steve jobs fan (gates has been my hero since 1996), but there is NO WAY he would have chosen to move to ARM just because of the legacy binary incompatibility.

My brain hurts reading this nonsense. ARMv8 is a perfectly stable, mature architecture with state of the art compiler support scaling from embedded to server market. It’s the most ubiquitous architecture in the world. Are you sure you are not talking about RISC–V or something?

Philip Turner said:
Apple's not going to make the GPU better for training. However, they might spec up the Neural Engine, and hype around FP8 spawned algorithms for training in FP16. It's cooler and sometimes faster for running Stable Diffusion too. I did some rough estimates, and you could run SD nonstop on the ANE for several hours, and not run out of battery.

Just a remark about that. A15/M2 introduced SIMD matmul operations which are identical in scope and function to Nvidias GPU tensor extensions. Right now these run on regular SIMD cores on Apple, but who knows what they plan to do in the future.

Xiao_Xi said:
Why do Intel and AMD use lower bandwidth in desktop and notebook CPUs? Costs? Market segmentation?
Can we expect AMD's 7040 series to have similar bandwidth to M2 SoCs?

It’s costs. Wide memory interfaces are very expensive. Customer CPUs use 128bit interfaces (like M1/M2) and so will all have comparable bandwidth when using same RAM technology.

In fact, wide interfaces are so expensive that GPU manufacturers have been cutting corners there recently. Some time ago flagship GPUs used 512–bit memory busses, now it’s 384 bits for Nvidia and 320 bits for AMD. They’d rather use a narrower interface and hotter RAM, that should already tell you something about cost distribution. And AMD even split memory controllers into separate multi–chip modules to cut costs. In the meantime Apple is shipping laptops with 512–bit memory interface. It’s insanity.

quarkysg said:
Edit: As far as I know, the Mx Apple SoC has the highest memory bandwidth to date for any consumer grade products. I did not read the spec of the 7040 series, but I doubt that they will have more than 256-bit of data bus size. I think most boards will have better memory thruput when matching DIMMs are installed in fours.

If you talk about CPU not, yes. With a caveat – an Apple CPU core cluster can only access 200GB/s, it can’t use the full bandwidth available to Max/Ultra. Which is still much higher of course than any other design on the market, including server grade CPUs (where available bandwidth per core is 40–50 GB/s at best)

quarkysg · Jan 13, 2023

leman said:
If you talk about CPU not, yes. With a caveat – an Apple CPU core cluster can only access 200GB/s, it can’t use the full bandwidth available to Max/Ultra. Which is still much higher of course than any other design on the market, including server grade CPUs (where available bandwidth per core is 40–50 GB/s at best)

My simple math tells me this is limited by the CPU clock of the M1 Max CPU (at 3.2GHz). If the CPU is processing data at 128 bits per clock, they will at most be able to push 51.2GB/s of data around. So internally, the L1 cache is probably feeding the CPU core at 256 to 512 bits of data per clock.

So it looks like the M1 Max's CPU core is better designed compared to AMD or Intel's CPU core when it comes to removing bottlenecks.

Confused-User · Jan 13, 2023

Xiao_Xi said:
Why do Intel and AMD use lower bandwidth in desktop and notebook CPUs? Costs? Market segmentation?

In a sense, you're asking the question backwards. It's more instructive to ask why the server parts have a wider memory bus.

Fundamentally, you want memory bandwidth to be great enough that you can keep all your cores fed, so that they don't stall. The bandwidth each core will need varies from workload to workload of course, but chip designers have some rough guidelines that cover the vast majority of use cases, and they will balance their ideal number, multiplied by the number of cores (or threads, dividing by effective thread performance hit from multithreading), against the cost of implementing it in their chip (and on the motherboard, which is a factor in system cost and therefore marketability of the chip).

For many years, the balance has worked out to two channels being optimal for consumer-grade and office-grade systems. As CPUs have scaled up in speed and core count, so have memory bus speeds, give or take, and cache sizes, which are the two things you need to keep pace. So that hasn't changed in a long time (though some chips, like Intel Atoms, have supported single channels as well, or instead).

But with server parts you have a lot more cores to keep fed, and a much higher typical rate of system utilization. You need much more bandwidth. Better/bigger caches can only help so much, so you need more memory channels.

As for why AS has high bandwidth, it's due to the need to feed not just CPUs, but GPUs, NPUs, AMX, ISP etc. It's a bonus that when CPU cores are busy and other parts of the chip are not, the cores have more available bandwidth.

leman · Jan 13, 2023

quarkysg said:
My simple math tells me this is limited by the CPU clock of the M1 Max CPU (at 3.2GHz). If the CPU is processing data at 128 bits per clock, they will at most be able to push 51.2GB/s of data around. So internally, the L1 cache is probably feeding the CPU core at 256 to 512 bits of data per clock.

So it looks like the M1 Max's CPU core is better designed compared to AMD or Intel's CPU core when it comes to removing bottlenecks.

If I remember correctly, L1D on M1 is 48/bytes per cycle (which should translate to ~150GB/s). From benchmarks done by Anandtech it seems that L2 is around 200GB/s, and the L2 is shared for a cluster of four CPUs.

Yes, it's a better design overall. Intel and AMD have much higher L1 bandwidth (to drive SIMD throughout operations), but their caches are smaller and the bandwidth to individual cores are significantly lower. They also pay a substantial energy consumption penalty for having more L1 bandwidth. In Apple design, the super-wide SIMD (AMX) is instead fed off L2, which does wonders for overall energy efficiency. And sharing L2 between the four cores is also good for non-trivial multicore workloads. It's smarter, more elegant and more economical.

Xiao_Xi · Jan 13, 2023

quarkysg said:
My simple math tells me this is limited by the CPU clock of the M1 Max CPU (at 3.2GHz). If the CPU is processing data at 128 bits per clock, they will at most be able to push 51.2GB/s of data around. So internally, the L1 cache is probably feeding the CPU core at 256 to 512 bits of data per clock.

Anandtech wrote:

We noted how the M1 Max CPUs are not able to fully take advantage of the DRAM bandwidth of the chip

Does a comparison between M1 Pro (256 bit) and M1 Max (512 bit) show the impact of having higher bandwidth?

leman · Jan 13, 2023

Xiao_Xi said:
Does a comparison between M1 Pro (256 bit) and M1 Max (512 bit) show the impact of having higher bandwidth?

The extra bandwidth of the Max is just for the GPU. From the CPU capabilities standpoint there is no difference between the Pro and the Max. While Max has a larger SoC cache, you will be hard pressed to find a problem where this will be noticeable.

Xiao_Xi · Jan 13, 2023

leman said:
The extra bandwidth of the Max is just for the GPU. From the CPU capabilities standpoint there is no difference between the Pro and the Max. While Max has a larger SoC cache, you will be hard pressed to find a problem where this will be noticeable.

Does this mean that it is useless for AMD and Intel to manufacture CPUs with higher bandwidth unless they include a console-like iGPU?

leman · Jan 13, 2023

Xiao_Xi said:
Does this mean that it is useless for AMD and Intel to manufacture CPUs with higher bandwidth unless they include a console-like iGPU?

How did you reach that conclusion? It all depends on what problems you need to solve. For huge datacenter- or server-class CPUs that have many dozens of cores, more RAM bandwidth is always useful. New AMD Genoa chips for example have 12 memory channels with up to 460GB/s memory bandwidth, but that is shared across 96 CPU cores. A gaming CPU probably doesn't need that kind of bandwidth, because it doesn't deal with that class of problems.

Anyway, this is all explained much better than I ever could in #171

unrigestered · Jan 13, 2023

Also all these started with very slow CPUs, so initially it was the CPU that was the bottleneck

i don't know about very early architecture, but even my first IBM compatible PC (80286 @8MHz) had to do wait cycles for the RAM.
no idea about my computers before that, but i guess it was the same there too.
can't remember if it had cache memory, but my next one (80386 @40MHz) most certainly had, to somewhat compensate for the way too slow RAM. and of course the RAM itself had tons of waitstate again set up in the BIOS. when i set those too low, they system became quicker, but also started to become very glitchy.

of course that was about latency, and not necessarily about bandwith.

so do Apple Silicon chips still use some additional more responsive cache memory, or is the system RAM's latency actually low enough this time around so it's not the bottleneck for the CPU/GPU anymore?

senttoschool · Jan 13, 2023

Xiao_Xi said:
Does this mean that it is useless for AMD and Intel to manufacture CPUs with higher bandwidth unless they include a console-like iGPU?

Inside NVIDIA Grace CPU: NVIDIA Amps Up Superchip Engineering for HPC and AI | NVIDIA Technical Blog

Discover the key features and benefits of NVIDIA Grace CPU, the first data center CPU developed by NVIDIA. It has been built from the ground up to create the world’s first superchips.

developer.nvidia.com

What’s the point of ARM?

macrumors 68040

macrumors 68040

macrumors regular

macrumors 65816

macrumors 65816

macrumors 6502

Suspended

macrumors 68000

macrumors regular

macrumors regular

macrumors 6502a

macrumors 6502a

macrumors 68000

macrumors 65816

macrumors 65816

macrumors Core

macrumors 65816

macrumors 6502a

macrumors Core

macrumors 68000

macrumors Core

macrumors 68000

macrumors Core

Suspended

macrumors 68030

Our Staff