Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
Somehow I was under the impression that modern operating system kernels presented multiple CPUs as unified resources to their software. Am I thinking of something unique to BeOS?
Well the problem is that they DO present them as unified resources. But in a NUMA configuration they are not.

If CPU0 has 8 memory channels, and CPU1 has 8 memory channels, and CPU0 and CPU1 are connected via some form of interconnect, then the memory of CPU0 is "farther away" for CPU1 than its own memory. If the OS is not aware of that then it will shuffle threads away from the memory their data is located at.

For CPUs with multiple cores this isn't a problem, since they all connect to the same memory subsystem and hence have uniform memory latency. You can even get away with that if you put two CPUs/SoC on a package and have a high bandwidth-low latency interconnect (as M1 Ultra does). You can also alleviate this by going the EPYC route and putting the memory controller into a separate die all CPU dies then connect to, again having uniform memory latency.

But as soon as you have multiple sockets, and hence a physical interconnect of several centimeters of length the latency between sockets will introduce so much latency that a thread running on one CPU, but its data being on memory belonging to the other CPU will perform significantly worse.

This is the reason why an OS has to be NUMA (Non-Uniform-Memory-Access) aware to properly manage threads and memory allocation to reduce this kind of mismanagement. Linux, as an OS predominantly used in servers, which are virtually the only machines that have multiple sockets, is very good at managing this. Windows, for example, is not. And MacOS .... who knows.
 
Last edited:
They would still need to release 2 separate products: the new Mac Pro, that is 100% pure Apple Silicon
You see... I guess the problem with that is: does M1 Ultra even have the required IO for this? A Xeon has 44 PCIe 3 lanes. That's 352 gigabit of bandwidth. And even that isn't "much" if you compare it with EPYC CPUs that have 128 PCIe 4 lanes, totalling at 2 terabit. (That's btw the reason NVIDIA is running EPYC CPUs in their render boxes)

M1 Ultra has 6 TB4 connections, that's a total of roughly 240 gigabit. If we now assume that an AS Mac Pro would offer something like .... Thunderbolt-on-PCB, or even drive PCIe controllers via Thunderbolt (so, basically a PCIe enclosure), that would reduce the available Thunderbolt ports drastically. M1 Max has a very fast interconnect that I think they could wire up with a PCIe master, but running as M1 Ultra that interconnect is used.

So unless the M1 Max/Ultra doesn't have A LOT of additional bandwidth somewhere it doesn't even use right now .... an Apple Silicon Mac Pro would probably end up severely bandwidth starved considering they also drive potentially multiple high bandwidth displays via Thunderbolt that, to make matters even worse, expose high speed USB.

So actually.... maybe the 7.1 MP with Apple Silicon GPUs is the best they can currently do. And frankly: for a real rendering workstation it wouldn't even be a bad thing.
 
I completely forgot the base model was 256.

I hope people take a more sober look at this machine now that two years have passed. I recall watching YouTubers make videos about how they're selling their $15,000.00 Mac Pro and using an M1 iMac. It's like dude, if you can do that, this wasn't the machine for you in the 1st place.
You have to remember to take into consideration a lot of them are purchasing to review these machines. People like MKBHD and iJustine actually use theirs. MKBHD's studio is running 100% on 2019 Mac Pros, they have like 12 of them, and Justine won't let hers out of her sight, she works on it as her daily driver...but a lot of them just pick them up for review videos then take back.

Also, buying these mid tier Mac Pro's makes no sense. And THOSE people should be looking to pick up the Mac Studio 100%. That's who it's targeted at, but if you've got a $25k tier or higher, you're probably doing something with it that can't be done on anything less.
 
You see... I guess the problem with that is: does M1 Ultra even have the required IO for this? A Xeon has 44 PCIe 3 lanes. That's 352 gigabit of bandwidth. And even that isn't "much" if you compare it with EPYC CPUs that have 128 PCIe 4 lanes, totalling at 2 terabit. (That's btw the reason NVIDIA is running EPYC CPUs in their render boxes)

M1 Ultra has 6 TB4 connections, that's a total of roughly 240 gigabit. If we now assume that an AS Mac Pro would offer something like .... Thunderbolt-on-PCB, or even drive PCIe controllers via Thunderbolt (so, basically a PCIe enclosure), that would reduce the available Thunderbolt ports drastically. M1 Max has a very fast interconnect that I think they could wire up with a PCIe master, but running as M1 Ultra that interconnect is used.

So unless the M1 Max/Ultra doesn't have A LOT of additional bandwidth somewhere it doesn't even use right now .... an Apple Silicon Mac Pro would probably end up severely bandwidth starved considering they also drive potentially multiple high bandwidth displays via Thunderbolt that, to make matters even worse, expose high speed USB.

So actually.... maybe the 7.1 MP with Apple Silicon GPUs is the best they can currently do. And frankly: for a real rendering workstation it wouldn't even be a bad thing.
Hmmm, I wouldn't upset at it...but it's really going to come down to one bottom line...Apple is incredible with optimization and that's a part of why we pay the Apple tax, however...certain softwares need to REALLY be devloping specifically for the nature of that kind of beast or it's still a NO GO for folks like me.

My Mac Pro as currently configured runs like a puget with 3 RTX 3090's...That literally means any system I replace this with has to be BASELINE as fast in OCTANE RENDER AND REDSHIFT as 3 RTX 3090's. That is literally the ONLY point I'll be focused on. I dock my M1 Max MacBook Pro and feed it into my monitors and home server when I'm editing these days, so my editing needs are covered. I also ordered the Mac Studio for the music studio upstairs...which quite frankly, is where I edit anyway, so the Mac Studio will likely become my permanent music box as well as my permanent editing bay. So music and edits are covered.

Everything Adobe will be covered and fastest on my Mac Studio, and EVERYTHING GPU RENDER BASED IS DONE ONLY ON MY 7.1 Mac Pro. Which means whatever happens in December, it ultimately will only have a chance of replacing one piece of hardware in my studio, and that's the beast. If it can't BASELINE the Beast, then it isn't what I currently need it to be.
 
The fact that a "Pro" product doesn't start at 1TB is just stupid. Almost as stupid as 64GB of memory on a $600 iPad.
 
  • Like
Reactions: grrrz
Well the problem is that they DO present them as unified resources. But in a NUMA configuration they are not.

If CPU0 has 8 memory channels, and CPU1 has 8 memory channels, and CPU0 and CPU1 are connected via some form of interconnect, then the memory of CPU0 is "farther away" for CPU1 than its own memory. If the OS is not aware of that then it will shuffle threads away from the memory their data is located at.

For CPUs with multiple cores this isn't a problem, since they all connect to the same memory subsystem and hence have uniform memory latency. You can even get away with that if you put two CPUs/SoC on a package and have a high bandwidth-low latency interconnect (as M1 Ultra does). You can also alleviate this by going the EPYC route and putting the memory controller into a separate die all CPU dies then connect to, again having uniform memory latency.

But as soon as you have multiple sockets, and hence a physical interconnect of several centimeters of length the latency between sockets will introduce so much latency that a thread running on one CPU, but its data being on memory belonging to the other CPU will perform significantly worse.

This is the reason why an OS has to be NUMA (Non-Uniform-Memory-Access) aware to properly manage threads and memory allocation to reduce this kind of mismanagement. Linux, as an OS predominantly used in servers, which are virtually the only machines that have multiple sockets, is very good at managing this. Windows, for example, is not. And MacOS .... who knows.
Thank you for that explanation. It's greatly appreciated.

Since Mac OS is a Unix... should it not have the same ability?
 
  • Like
Reactions: DeepIn2U
This was a lot to unwind. I feel you, we all have to go through the desert at one time.

Might I suggest that as you have been able to "bear" windows for gaming, you might also be able to rev up blender and do it all the same.

Blender is almost an "os" in it's own right, and while I may be too amateur to know any of the difference occurred by distributed computing as described in your iMac debacle. I HAD to buy a win-lux machine. OptiX simply is that good. It's financially irresponsible to buy mac hardware in this use case unless you do it for fiscal reasons.

I don't interact with the device, simply remoting into it with Microsoft Remote Desktop.

EDIT: Did not read about your need for Logic, no expertise about the hardware need of that software, and if you can get away with the base M1 mini, that could serve as an end point to access your despicable WinMachine (Some cheap gaming rig with RTX, once the GPU market cools down) you could be golden for a while.

You don't need a high tier GPU to work with blender professionally, anything with 8 GiGs of memory and RTX will serve you well so long as you don't aspire for photorealism. And even then, clever shaders often outshine the heavy textures that would saturate your limited DRAM. And look it up, GET A GPU THAT ENABLES OPTIX, night and day QOL for the viewport.
This is an interesting suggestion. My resistance: Remoting into a Windows machine is still using it, even if it's just to primarily use one particular piece of software with its own non-standard UI elements. I'm sure I would still have to be interacting with Explorer, etc, and trackpad behavior would probably also be very different.

The image compression and latency, I have found, are also annoying (I occasionally have to restart a remote session between my two Macs; I can only imagine how it would be with Windows on the other end).

Come to that, I assume using a SpaceNavigator through a remote session would be impossible (though, I wouldn't be surprised to find that my SpaceNavigator is also abandoned for M1 Macs; I ought to check on that at some point).

Worth testing. I'm not sure I can even test that on my existing Windows 10 PC and iMac with existing software. Hm. You've got me curious...
 
Thank you for that explanation. It's greatly appreciated.

Since Mac OS is a Unix... should it not have the same ability?
It's not something that just magically appears just because you have a Unix system. You'll have to have the code in the kernel, and it has to work for your hardware. And such code isn't trivial - however, it's not impossible to write and get tested in a few weeks. I'm rather confident that Apple could introduce this rather quickly if they really would want to go that route.

I'm however very skeptical they would do that. As I outlined above there is no obvious way how they would connect two or even more M1 Ultras. The only feasible option I see is if they somehow can shove a high bandwidth interconnect between the two M1 Maxes - or if something like that already exists on M1 Max and we just don't know about it.
 
  • Like
Reactions: dysamoria
This is an interesting suggestion. My resistance: Remoting into a Windows machine is still using it, even if it's just to primarily use one particular piece of software with its own non-standard UI elements. I'm sure I would still have to be interacting with Explorer, etc, and trackpad behavior would probably also be very different.

The image compression and latency, I have found, are also annoying (I occasionally have to restart a remote session between my two Macs; I can only imagine how it would be with Windows on the other end).

Come to that, I assume using a SpaceNavigator through a remote session would be impossible (though, I wouldn't be surprised to find that my SpaceNavigator is also abandoned for M1 Macs; I ought to check on that at some point).

Worth testing. I'm not sure I can even test that on my existing Windows 10 PC and iMac with existing software. Hm. You've got me curious...
You can remote into any Windows 10 or 11 desktop via Microsofts RDP tool, which you can download in the App Store for free. Performance even isn't totally terrible (assuming both machines are local and on copper)

If you need a low latency environment and can compromise on image quality: try Moonlight if your Windows PC has a recent NVIDIA GPU. You can also get this to work with AMD GPUs, but the result is notably worse and quite the hassle to accomplish.
 
  • Like
Reactions: dysamoria
I’d love to know how much of the SoC is being utilized at any given moment.

The RAM, CPU, and GPU cores get used just by starting more software that actively processes something, especially from anything doing graphics.

But what about the specialist cores? Does Mac OS utilize them for anything, or are they sitting there doing nothing most of the time, without third-party software written to use them? Is it like owning an Afterburner card in a Mac Pro but never doing video production?
I’d imagine that the Secure Enclave and the Neural Engine get plenty of use from background processes. I wouldn’t at all be surprised if Spotlight indexing uses the neural engine, and, obviously, Photos is doing stuff with it in the background (and the system is using it for task scheduling, Siri Suggestions and stuff like that).
 
  • Like
Reactions: dysamoria
What’s the point of this anymore
RAM (upgradeable) Storage(upgradeable) Intel, PCIe slots (some people/environments need special cards) Our networking scheme uses 20gb/s optical fiber back to a video server with 750TB storage, host card for I/O audio/video breakout box, Overall expandability...Sonnettech HD bracket add 4 SSD's internally, etc. Microsoft native issues, The Mac Studio is just not at that level yet.
 
That's true, this is a workstation not a Facebook machine. Professionals will make the money back with one job!
That is very true as to people are not getting that there is a difference between a "desktop" computer and a "workstation" grade machine. The other one in this realm is the HP Z8....maxes out a a whopping $125K
 
I'm glad I'm not the only one who feels that way about Relay. To listen to the excuses those guys give for their purchase of every new computer makes you laugh out loud.

Incredible really .. honestly chafes me a bit with the laughing nature of which so many in the Apple podcast space constantly come up with new excuses to waste money on upgrades they have zero use for.

Must be nice I guess. Just rubs me the wrong way to see so much waste and lack of a sense of what the Apple experience is like for normal budget users (expensive, restrictive, odd at times, etc).
 
  • Like
Reactions: IllinoisCorn
Why are we assuming an AS Mac Pro will be upgradeable? More powerful sure, but upgradeable? It's all a system on a chip with RAM and everything on one piece.

Let me ask Mac Pro users this---would you want to give up the unified memory architecture in the AS machines?

Basically, if you want unified memory (and I think professionals would for video memory), you're going to have to be prepared for a machine that is either not possible to upgrade or a machine with minimal upgrades.
Nope, For "video pro's" at the higher level will want more than that. 2 W6800X Duo cards with 64GB RAM each and then the AfterBurner card....add 1.5TB system RAM capability, No SOC is going to match that....now I/we must have PCIe card slots for optical network cards, host cards for I/Os, etc....This will not compete with "workstation" class machines. If Apple wants to get out of that market, then so be it. There is always Win machines for the high end stuff...HP Z8 line.
 
Last edited:
You can remote into any Windows 10 or 11 desktop via Microsofts RDP tool, which you can download in the App Store for free. Performance even isn't totally terrible (assuming both machines are local and on copper)

If you need a low latency environment and can compromise on image quality: try Moonlight if your Windows PC has a recent NVIDIA GPU. You can also get this to work with AMD GPUs, but the result is notably worse and quite the hassle to accomplish.
Thank you for this info.

No, my “newest” PC is from 2006-2008 (an EVGA-branded GPU with NVidia GForce 8800GTX).
 
It's very telling that Apple is releasing a new desktop with an M1 Ultra option in Mac Studio that's meant to replace the 27" iMac. They don't really have to do that since the 27" iMac was never meant to be a Pro level machine like Mac Pro. Yet Mac Studio with M1 Ultra can now potentially compete with Mac Pro in the lower end of the high-end Pro market.

This means two things:
  1. Either Apple has much more powerful Apple Silicon chips still in the pipeline, i.e., M2, that they will put in the new Mac Pro (thus completing the transition by their self-imposed deadline in Nov 2022).
  2. Or they've decided to keep Intel-based Mac Pro in their line of business till 2023 as, just like you said, "an insurance policy" for certain customers.
I just don't see them releasing a new Apple Silicon-based Mac Pro where the base model is only slightly faster than the highest-end Mac Studio.

It's probably safe to say that the gap between Intel and Apple Silicon narrows the higher end you go because power consumption becomes less of an issue. If M1 Ultra is already leaps and bounds ahead of its Intel counterpart, Mac Pro's chip has to be at least 60% faster than M1 Ultra for Apple to deem it worthwhile to introduce a new machine.

Most likely they'll keep Intel-based Mac Pro, as an option alongside an Apple Silicon Mac Pro at least, well into 2023.
I simply can't see them getting rid of the Intel based Mac Pro until they can beat it completely. As I've said elsewhere, my 28core, 2 w6800x DUO GPU's 7.1 is equivalent to 3 RTX 3090's. So in order for an M based Mac Pro to be released, for people like me, it's BASELINE has to be more powerful than 3 RTX 3090's, or there is literally zero reason to upgrade. My M1 Max MacBook Pro and soon to have Mac Studio take care of literally every other need I have...the only thing an M based Mac Pro could offer me, is fast render times for my 3D animation, simulations, and VFX...
 
  • Like
Reactions: lysingur
It's not something that just magically appears just because you have a Unix system. You'll have to have the code in the kernel, and it has to work for your hardware. And such code isn't trivial - however, it's not impossible to write and get tested in a few weeks. I'm rather confident that Apple could introduce this rather quickly if they really would want to go that route.

I'm however very skeptical they would do that. As I outlined above there is no obvious way how they would connect two or even more M1 Ultras. The only feasible option I see is if they somehow can shove a high bandwidth interconnect between the two M1 Maxes - or if something like that already exists on M1 Max and we just don't know about it.

Didn’t macOS X have this ability with MACH kernel with PowerPC XServes?
 
Didn’t macOS X have this ability with MACH kernel with PowerPC XServes?
I did a bit of digging regarding this, and found this on Stack Exchange:

No Mac Pro supports NUMA, their memory are configured in interleave mode.
Highly memory intensive application faced with incorrectly allocated (cross NUMA node) memory pages can still cause significant (>20%) performance issue albeit "modern processor architecture" (I'm an HPC admin a regularly benchmark / troubleshoot system performance issue).
And someone else has installed a NUMA aware OS (reads: Linux) onto a Mac Pro and finds only a single NUMA node, which shows that the firmware configures the memory to run in interleave mode (non-NUMA configuration). While this is slower than a properly configured NUMA system, it is still faster than an incorrectly configured / allocated NUMA system.
With this, we may reason that the mach/xnu kernel is likely non-NUMA aware.

The thing with the old Westmere Xeons in the original Mac Pro (and those a bit older used in the XServes) is that they do not have an integrated memory controller such as more recent CPUs, including Apple's SoCs, but rely on an external one (which some may remember to be called Northbridge). This made dual socket systems quite a bit easier, since two CPUs could "share" a single north bridge, and do what the article above calls "interleave mode", basically both CPUs using all the memory, but in alternating cycles, hence having effectively half the bandwith and double the latency. But this is still faster than having to shove wrongly assigned memory between CPUs.

Now most CPUs have their own memory controllers on silicon - with the curious exception of the Zen 2 and 3 AMD chips, which re-introduced a shared north-bridge, just "on-package", so the CPU appears to the outside as a uniform node. That's the reason the OS had to be properly NUMA aware to efficiently run a TR2970X or TR2990X, but not for the 3000 series. However even that doesn't come without cost. Memory latency for Zen 2 and Zen 3 is notably worse than other monolithic designs (which is why Zen 3 is 50% cache memory as of die space), so it makes very good sense to have a monolithic design IF you can manufacture the chips you need at the right cost/yield - which is an area Apple has more leeway since they can sell their chips for more money than AMD can. (I guess it also helps that N5 is has spectacular yields.)

But since M1 Max (and hence Ultra) comes with its own memory controllers, for the OS to drive more than one socket of these it would have to be NUMA aware since interleaving between two sockets, especially considering how reliant M1 is on low latency high bandwith memory.

Building efficient multi-socket systems is something very hard to do. If it were easy the current norm would be to have several CPUs even in consumer hardware. It is much easier and much cheaper to design and manufacture a small chip than it is to manufacture a big one. AMDs approach with Zen 2/3 Epyc is probably the best we've seen here, since it basically is "Multi-Socket-On-A-Package", and in my opinion Apple would have to do something similar to wire up more than two chips.

Although: adding NUMA awareness on the kernel level really isn't impossible. And considering the tremendous resources Apple has at its disposal not even "hard" to accomplish in a few weeks. The actual challenge lies in the fact that to make the best use out of it applications also have to be aware that the underlying architecture is not uniform.
 
  • Like
Reactions: DeepIn2U
I do think the stars are pointing to no new intel mac pro at this point which deeply saddens me. I really wanted one more and at this point I think buying a 2019 used doesn't make sense, I'm not going to pay the absurd prices for workstation AMD cards that are nearly a generation old especially since stock is improving with the desktop ones that perform nearly identical minus the thunderbolt passthrough etc.

Ordered a maxed out Mac Studio for the time being to replace my 16" i9 that is way, way too hot and loud. If I find a lot of of incompatibility or other issues I'm going to suck it up and Hackintosh for a year until the situation improves, but I really don't want to do that. I just want things to work reliably, quietly, and fairly quickly.
 
I did a bit of digging regarding this, and found this on Stack Exchange:



The thing with the old Westmere Xeons in the original Mac Pro (and those a bit older used in the XServes) is that they do not have an integrated memory controller such as more recent CPUs, including Apple's SoCs, but rely on an external one (which some may remember to be called Northbridge). This made dual socket systems quite a bit easier, since two CPUs could "share" a single north bridge, and do what the article above calls "interleave mode", basically both CPUs using all the memory, but in alternating cycles, hence having effectively half the bandwith and double the latency. But this is still faster than having to shove wrongly assigned memory between CPUs.

Now most CPUs have their own memory controllers on silicon - with the curious exception of the Zen 2 and 3 AMD chips, which re-introduced a shared north-bridge, just "on-package", so the CPU appears to the outside as a uniform node. That's the reason the OS had to be properly NUMA aware to efficiently run a TR2970X or TR2990X, but not for the 3000 series. However even that doesn't come without cost. Memory latency for Zen 2 and Zen 3 is notably worse than other monolithic designs (which is why Zen 3 is 50% cache memory as of die space), so it makes very good sense to have a monolithic design IF you can manufacture the chips you need at the right cost/yield - which is an area Apple has more leeway since they can sell their chips for more money than AMD can. (I guess it also helps that N5 is has spectacular yields.)

But since M1 Max (and hence Ultra) comes with its own memory controllers, for the OS to drive more than one socket of these it would have to be NUMA aware since interleaving between two sockets, especially considering how reliant M1 is on low latency high bandwith memory.

Building efficient multi-socket systems is something very hard to do. If it were easy the current norm would be to have several CPUs even in consumer hardware. It is much easier and much cheaper to design and manufacture a small chip than it is to manufacture a big one. AMDs approach with Zen 2/3 Epyc is probably the best we've seen here, since it basically is "Multi-Socket-On-A-Package", and in my opinion Apple would have to do something similar to wire up more than two chips.

Although: adding NUMA awareness on the kernel level really isn't impossible. And considering the tremendous resources Apple has at its disposal not even "hard" to accomplish in a few weeks. The actual challenge lies in the fact that to make the best use out of it applications also have to be aware that the underlying architecture is not uniform.
They're doing some weird interposer magic with the M1 Ultra at least for the GPU to make it addressable as a whole, but it would be interesting to see some specific 3d benchmarks that put it over the threshold of 32 or 64gb (per Max die) slightly to see the performance hit on going across the 'UltraFusion' interconnect, which I'm sure is small but probably is measurable.

I haven't seen this yet but hopefully anandtech will do a test like this because it will give us a good idea for what to expect with the M1 Mac Pro (if it exists). If there's a decent amount of latency introduced it might be worth waiting for the M2 version when they refine things a bit. This interconnect technology is from 2017 or so and while it's cutting edge in that they're the first to use it, it isn't cutting edge as far as what TSMC is capable of for future products.

The die stacking and vertical/'3D' interconnects will be very interesting and could truly reduce latency tremendously just due to the very short trace length - see HBM2/3.

Also needing to address the neural engines discretely in the Ultra for ML makes it seem a little bit unfinished/like a compromise product. But the achievement with the GPU really should not be understated, it is huge, especially for a first-gen architecture.

I really want an Icelake (or better) Mac Pro to hold me over for 3-5 years while they sort all this out and get things 100% compatible, but it seems unlikely at this point since the last intel Mac came out in August 2020.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.