What’s the point of ARM?

Confused-User · Jan 15, 2023

kevcube said:
that type of computational workload isn't done on desktop machines

Sorry, that's just wrong. There are entire product lines at all the major x64-based manufacturers (Lenovo, Dell, HP - and a bunch of specialists as well) that serve this market. Intel makes the W-series chips just for this market. AMD has the threadripper pro line, for roughly the same reasons.

The big question is, does Apple see this market as either
- big enough to make the slice of it they can take worthwhile, or
- owning enough mindshare to make having a halo product in that category worthwhile?

I guess we'll know when they finally release their AS Mac Pro. If they do, they'll include memory expansion as an option. If not, then they won't.

I hope they do, but I'd give it even odds at best.

Basic75 · Jan 15, 2023

kevcube said:
that type of computational workload isn't done on desktop machines

Are you're saying the people in this very thread reporting they have and need 1TB of RAM in a Mac Pro are lying?

JouniS · Jan 16, 2023

kevcube said:
Audio+Storage, thunderbolt is as good as PCI minus the additional desk space. But you gain the portability when you use thunderbolt. So I see no upside to expandable apple silicon in a Mac-pro sized enclosure.

Thunderbolt suffers from the same problem as it always has: it's a fast port but a slow bus. You need two Thunderbolt ports for each SSD unless you want to sacrifice speed. And if you need multiple drives, you will quickly run out of ports.

heinzdembowski · Jan 16, 2023

> Lastly, AMD CPUs and Nvidia GPUs simply don't support a lot of features that Apple Silicon can. IE. Using the neural engine for extracting objects from photos. Using ProRes encoders. Using the secure enclave. Using image processing acceleration. Even if these parts can be replicated with AMD + Nvidia, Apple would not want to. Not for the Mac Pro which is a niche.

How do you claim something so wrong with so much confidence?

unrigestered · Jan 16, 2023

well, he's right: they can't use the "Neural Engine"

they could only use their graphics cards instead, which in many cases should be way more powerful anyway 🤓

senttoschool · Jan 16, 2023

heinzdembowski said:
> Lastly, AMD CPUs and Nvidia GPUs simply don't support a lot of features that Apple Silicon can. IE. Using the neural engine for extracting objects from photos. Using ProRes encoders. Using the secure enclave. Using image processing acceleration. Even if these parts can be replicated with AMD + Nvidia, Apple would not want to. Not for the Mac Pro which is a niche.

How do you claim something so wrong with so much confidence?

What's wrong?

senttoschool · Jan 16, 2023

Basic75 said:
Are you're saying the people in this very thread reporting they have and need 1TB of RAM in a Mac Pro are lying?

Yes.

leman · Jan 16, 2023

senttoschool said:
What's wrong?

Everything except the ProRes part (which makes sense, as ProRes is Apple proprietary tech)? Of course, one can argue that Apple's neural engine is much more efficient than alternatives, which is probably true, but that's a different argument altogether. Nvidia ML accelerators are definitely orders of magnitude faster.

neuropsychguy · Jan 16, 2023

leman said:
Which software domain? CPU only or are you interacting with other devices (storage, GPU, network)?

Mostly CPU (of course storage is being accessed) running neuroscience software locally. A about 1.5 years ago I did a trial run on the same system dual booting Ubuntu and Windows 10 with WSL2. My Windows install was on a faster SSD than my Linux install so it wasn't directly comparable but still the software was about 15% faster in native Ubuntu than WSL2 at the time. I haven't run updated speed comparisons recently though. It's worth a trial but most of my work is done on a university cluster rather than a local workstation so my local work is mostly just figuring out a workflow/pipeline, which I then implement on the cluster. This means any potential speed penalties don't really matter. None of my work is time-sensitive where an extra 15% time would make a difference. If my work takes 3.5 hours instead of 3 hours, that's not a problem. It just runs in the background and I work on other things.

I'm happy with the time because years ago it used to take 24-36 hours to do what I can do in about 3 hours due to CPU speed increases and software optimization. With a cluster I can do a lot in parallel so some of my work (keeping this non-specific) that used to take 2 - 3 months to do can now be done in about 6 hours. That's why when I work on my workflow/pipeline locally, I don't care if WSL2 is slower (again I haven't compared speeds in the past 1.5 years so it's likely the speed difference is smaller than 15%); it's only run 1-2 times and then implemented in parallel on a cluster natively running Linux.

macOS is great too, although not all software I use has been fully translated to work on ARM/Apple Silicon yet, although this is in process.

leman · Jan 16, 2023

neuropsychguy said:
Mostly CPU (of course storage is being accessed) running neuroscience software locally. A about 1.5 years ago I did a trial run on the same system dual booting Ubuntu and Windows 10 with WSL2. My Windows install was on a faster SSD than my Linux install so it wasn't directly comparable but still the software was about 15% faster in native Ubuntu than WSL2 at the time. I haven't run updated speed comparisons recently though.

Ah, so you are comparing WSL vs directly booted Linux? I had overhead of WSL itself in mind, that is, comparing performance under WSL with the performance in Windows (without WSL).

senttoschool · Jan 16, 2023

leman said:
Everything except the ProRes part (which makes sense, as ProRes is Apple proprietary tech)? Of course, one can argue that Apple's neural engine is much more efficient than alternatives, which is probably true, but that's a different argument altogether. Nvidia ML accelerators are definitely orders of magnitude faster.

senttoschool said:
Lastly, AMD CPUs and Nvidia GPUs simply don't support a lot of features that Apple Silicon can. IE. Using the neural engine for extracting objects from photos. Using ProRes encoders. Using the secure enclave. Using image processing acceleration. Even if these parts can be replicated with AMD + Nvidia, Apple would not want to. Not for the Mac Pro which is a niche.

I think you need to read the bolded part.

Nothing I wrote is technically wrong.

neuropsychguy · Jan 16, 2023

leman said:
Ah, so you are comparing WSL vs directly booted Linux? I had overhead of WSL itself in mind, that is, comparing performance under WSL with the performance in Windows (without WSL).

That's a great question about WSL versus Windows. I haven't done that because the software I use doesn't have native Windows builds. I do use some Matlab so I could run Matlab in WSL versus Windows native but I don't have time for this at the moment.

avkills · Jan 16, 2023

heinzdembowski said:
> Lastly, AMD CPUs and Nvidia GPUs simply don't support a lot of features that Apple Silicon can. IE. Using the neural engine for extracting objects from photos. Using ProRes encoders. Using the secure enclave. Using image processing acceleration. Even if these parts can be replicated with AMD + Nvidia, Apple would not want to. Not for the Mac Pro which is a niche.

How do you claim something so wrong with so much confidence?

This is wrong on so many levels. The underlying Metal API and OS determines what is used for a lot of computational workloads.

IE if someone has a Afterburner card in their 2019 Mac Pro; then that handles ProRes decoding; and people who do not have Afterburner cards can still do the same work, except the GPU has to do all of the decoding and it will not be as efficient. Folks with Apple silicon have dedicated encoders/decoders for Pro Res, so that is what is used.

When I first purchased my Mac Pro, I only had the 580x GPU and encoding stuff was slow. As soon as I added the W6800x Duo encoding got a whole lot faster.

Ventura already has drivers for AMD cards; all that really needs to happen is for Apple to work with AMD so whatever pre-boot BIOS/UEFI type thing is being used works. Although the lack of external GPU support on the current M series Apple silicon is kind of a bummer and does not make it seem Apple will go that route.

Apple could still have Apple Silicon and external GPU support. But if the rumors are to be believed; and Apple is using the same enclosure so we have PCI expansion needs; pretty sure Apple probably has something else they will be using for the Mac Pro.

Whatever happens, I am probably going to be happy that I bought the 7,1 when I did.

avkills · Jan 16, 2023

senttoschool said:
Yes.

Apparently you are one who does not have to deal with massive pixel resolutions or massive scientific simulations. 640k ought to be good enough for you.

mr_roboto · Jan 16, 2023

GrumpyCoder said:
Good luck with that. I'm running a Linux rig with dual 4090 and I can tell you two 4090 wouldn't fit in the current design. In fact, with proper cooling the case is huge. I'm only running two 4090 in the case, nothing else and the case doesn't fit under the desk. They would have to engineer a specific solution for such a setup (water cooled G5 says hello).

I am not making a case that Apple will go back to Nvidia (they won't), but Apple's 2019 Mac Pro supports installation of two 500W MPX GPU modules, 1kW total. I don't see why you think dual 4090 would be difficult to put in, especially if the cards are redesigned as MPX modules rather than generic PCIe. (MPX is much better for high cooling requirements.)

mr_roboto · Jan 16, 2023

quarkysg said:
You argument in distrusting Anandtech's SPEC figures because it uses WSL is probably unwarranted IMHO. If my understanding is correct, SPEC does not measure OS capability. Rather it tries to measure a CPU's and it's surrounding subsystem's capability. Once data is loaded into memory, the OS basically gets out of the way, unless of course the OS gets in the way when shuffling data around memory.

On the other hand, SPEC has been developed to run in a Unix environment, so it rely on how optimal the system libraries has been designed (e.g. glibc) and optimised. So it may turn out that WSL' system libraries (which likely is a shim to Windows's kernel APIs) may be sub-optimal, but I don't think it will be that bad, since it will be mainly just translating one API call to another. I don't think SPEC's components are run with constant calls into system APIs.

This is correct. SPEC CPU benchmarks are, in their own words, 'drawn from the compute-intensive portion of real world applications'. Compute-intensive is significant here. They take steps to minimize or eliminate use of system calls during the measured part of each benchmark run. The goal is to measure CPU performance, not disk or network or OS kernel performance.

So, for example, SPEC CPU's GCC benchmark consists of GCC compiling itself. Conventionally, you'd expect this to involve invoking GCC thousands of times, once per C source file - that's a lot of system call overhead. Instead, the GCC being compiled has had all its source amalgamated into one giant file, and preprocessing is done ahead of time. The benchmarked task is to compile this single, already-preprocessed file. Very little system call overhead.

WSL1 is notoriously poor at disk and possibly other forms of IO, but as SPEC tries to avoid testing that stuff, there shouldn't be much problem.

WSL2 should have some overhead. Unlike WSL1, it's a virtual machine, and VMs always have some overhead for guests based on having to virtualize page fault handling. But it shouldn't be a ton of overhead either. More if we had to worry about system call and especially IO overhead, but same comment as WSL1 - SPEC is already trying not to test that.

Confused-User · Jan 16, 2023

mr_roboto said:
I am not making a case that Apple will go back to Nvidia (they won't), but Apple's 2019 Mac Pro supports installation of two 500W MPX GPU modules, 1kW total. I don't see why you think dual 4090 would be difficult to put in, especially if the cards are redesigned as MPX modules rather than generic PCIe. (MPX is much better for high cooling requirements.)

You could do it, but you'd be pushing things dangerously close to the edge if you didn't carefully limit power draw of the modules. Normal 4090s can draw more than 500W each, though not typically. Of course the Pro's power supply is rated for 1400W, but you don't want to run up to the limit (which, given loss, is going to be under 1300W anyway), and the rest of the system is going to draw a good amount of power - the Xeon alone can draw over 200W.

All in all, not a good idea unless the MPX modules are carefully power-limited.

senttoschool · Jan 16, 2023

avkills said:
Apparently you are one who does not have to deal with massive pixel resolutions or massive scientific simulations. 640k ought to be good enough for you.

Those things are done on the cloud now, which is what is being discussed. There is almost no advantage to run a science simulation on a local desktop over the cloud.

Right now, no one serious is buying a Mac Pro to run scientific simulations.

End of discussions.

senttoschool · Jan 16, 2023

avkills said:
This is wrong on so many levels. The underlying Metal API and OS determines what is used for a lot of computational workloads.

IE if someone has a Afterburner card in their 2019 Mac Pro; then that handles ProRes decoding; and people who do not have Afterburner cards can still do the same work, except the GPU has to do all of the decoding and it will not be as efficient. Folks with Apple silicon have dedicated encoders/decoders for Pro Res, so that is what is used.

When I first purchased my Mac Pro, I only had the 580x GPU and encoding stuff was slow. As soon as I added the W6800x Duo encoding got a whole lot faster.

Ventura already has drivers for AMD cards; all that really needs to happen is for Apple to work with AMD so whatever pre-boot BIOS/UEFI type thing is being used works. Although the lack of external GPU support on the current M series Apple silicon is kind of a bummer and does not make it seem Apple will go that route.

Apple could still have Apple Silicon and external GPU support. But if the rumors are to be believed; and Apple is using the same enclosure so we have PCI expansion needs; pretty sure Apple probably has something else they will be using for the Mac Pro.

Whatever happens, I am probably going to be happy that I bought the 7,1 when I did.

Again, you seem to have missed one fundamental issue: Apple does not want to replicate every single feature that just works with Apple Silicon to AMD + Nvidia. Not for the niche Mac Pro that only sells in the tens of thousands each year.

Anyone with an ounce of software engineering experience would tell you that "anything is possible" with enough time and resources. You can probably brute force or emulate a lot of what Apple Silicon does using AMD + Nvidia. But anyone with an ounce of business mindset would tell you that there's no way it's worth it to support an entirely different architecture for all your features for just a few Mac Pro users.

Take for example, all the features of Apple's T2 chip is now entirely inside an Apple Silicon chip. The T2 chip is now way out of date for a modern macOS experience. This means if Apple were to use AMD + Nvidia chips, they'd have to engineer and build a brand new T3 chip just to provide basic macOS features to the Mac Pro. No way.

I highly recommend reading my original post. It's very relevant to this discussion: https://forums.macrumors.com/threads/whats-the-point-of-arm.2376571/post-31866989

mr_roboto · Jan 16, 2023

Confused-User said:
It's not that it's naive, so much as you have to realize that the devil is in the details. For example, it's easy to say that the interconnect joins the chip networks, but *how* physically does it do that? (It's not like it's an Ethernet and you can just throw a switch in between the two... though that is one conceptual approach you can take.)

It's true that a NoC (network-on-chip) is quite different from ethernet. However, general networking concepts still apply. You can just throw in something broadly similar to an ethernet switch. Even a single die couldn't function at all without switches.

For M1 Ultra, Apple's die-to-die interconnect solution appears to be simple brute force. They claim that ~10000 interconnect signals provide 2.5 TB/s bandwidth. 2.5e12 * 8bits/byte = 20e12 bits per second. 20e12 bps/10e3 wires = 2e9 bits/s per wire. That 2 GHz frequency plus the extremely short reach of the interconnect suggests that Apple didn't bother with SERDES. Instead, they're taking advantage of having so many signals by just running their on-die NoC protocol through the wires connecting the die together, and probably at the standard on-die NoC frequency.

If there's really no frequency change, it may be as simple as a flop just before each off-die transmitter pin and a flop just after each off-die receiver pin, adding 2 cycles of latency when crossing die. Even if it's slightly more complex than that, this is how they were able to claim software can treat M1 Ultra as if it's just one giant SoC. Technically it's a NUMA, but the non-uniformity is so mild that software can safely ignore it.

Confused-User said:
And once you've solved that problem for an Ultra, two back-to-back Max chips, does that solution scale at all? It does not, not without at least some additional work, because you can't lay out more than two chips all with direct links to each other of that scale. Thus is born the "I/O die"... maybe.

Right, if you need longer reach (more physical distance between die) and/or more die-to-die links, you'll eventually be forced to go with SERDES to get more bandwidth out of each off-chip wire. SERDES always add latency. There's well-known ways of mitigating this; if you want an example, look up what Intel has disclosed about its CSI/QPI/UPI interconnect for multi-socket systems. Intel has to use SERDES there because their comm links are off-package and potentially even off-board, not just off-die, so there are a few orders of magnitude fewer signals than 10K, several orders of magnitude longer distance, and connectors. They also have to support all kinds of other stuff needed to scale up to hundreds or thousands of nodes.

What interests me is how big a system-on-package can Apple build with these extremely short reach edge-to-edge links they've demonstrated with M1 Ultra. You can't build an arbitrarily big system this way, it can't possibly scale out to huge systems like Intel's interconnect does, but Apple doesn't need that. They only need enough scaling to cover the existing Mac workstation market.

mr_roboto · Jan 16, 2023

senttoschool said:
Those things are done on the cloud now, which is what is being discussed. There is almost no advantage to run a science simulation on a local desktop over the cloud.

People who think everything compute intensive can be done in the cloud are people who haven't tried to do many kinds of different compute intensive tasks in the cloud.

mr_roboto · Jan 16, 2023

Confused-User said:
You could do it, but you'd be pushing things dangerously close to the edge if you didn't carefully limit power draw of the modules. Normal 4090s can draw more than 500W each, though not typically. Of course the Pro's power supply is rated for 1400W, but you don't want to run up to the limit (which, given loss, is going to be under 1300W anyway), and the rest of the system is going to draw a good amount of power - the Xeon alone can draw over 200W.

All in all, not a good idea unless the MPX modules are carefully power-limited.

With modern sophisticated power control, it's not difficult to do that kind of peak limiting with firmware. NVidia's Quadro (workstation/scientific/server product line) GPUs already do it, because they go into systems with more need to control this than desktop PCs, and a hypothetical MPX-ified 4090 would too.

senttoschool · Jan 16, 2023

mr_roboto said:
People who think everything compute intensive can be done in the cloud are people who haven't tried to do many kinds of different compute intensive tasks in the cloud.

Care to explain more? Please don't just give one or a few niche edge cases and then justify the main argument that Apple should use AMD + Nvidia chips for the Mac Pro.

Why would someone run science simulations on a $40k Mac Pro with 1TB instead of renting a cloud cluster? Or renting a government data center?

Also, we're talking about science simulations. Let's not move the goal post to "everything compute intensive".

TechnoMonk · Jan 16, 2023

kevcube said:
that type of computational workload isn't done on desktop machines

They call them workstations for a reason.

TechnoMonk · Jan 16, 2023

mr_roboto said:
With modern sophisticated power control, it's not difficult to do that kind of peak limiting with firmware. NVidia's Quadro (workstation/scientific/server product line) GPUs already do it, because they go into systems with more need to control this than desktop PCs, and a hypothetical MPX-ified 4090 would too.

It is damn difficult on a 4090 in real life. My 4090 is bringing down a 1600W power supply to its knees. The big 3, Nvidia, Intel and AMD need to improve their power efficiency. I have to undervolt and under lock my 4090 for consistent AI inference use.

What’s the point of ARM?

macrumors 6502a

macrumors 68020

macrumors 6502a

macrumors member

Suspended

macrumors 68030

macrumors 68030

macrumors Core

macrumors 68040

macrumors Core

macrumors 68030

macrumors 68040

macrumors 65816

macrumors 65816

macrumors 6502a

macrumors 6502a

macrumors 6502a

macrumors 68030

macrumors 68030

macrumors 6502a

macrumors 6502a

macrumors 6502a

macrumors 68030

macrumors 68040

macrumors 68040

Our Staff