Apple's M1 Ultra Chip: Everything You Need to Know

Piggie · Apr 2, 2022

I need to ask what some may consider a very stupid question, even though to me it's a valid question to ask, but my total lack of knowledge about the way software talks to hardware makes me feel it's a good question to ask.

As I understand it, you can buy the lowest spec cheapest M1 Apple Product, and the most expensive $4000+ M1 Ultra product, and for "single core" applications, of which there are loads and loads, they will both perform exactly the same, which I'm sure there are many who even now don't realise this fact.

So my question is this:

Why is it not possible for the latest hardware to fool software, and the hardware itself spread the command sent to it from software across multiple cores, and then send back the result.

A bit like me saying I can give my order to one waitress (as I can only speak to one person at a time)
But without me knowing, she then tells a team of five people to make my order, once complete, they give the order back to her who gives it to me.
So, to me (the software?) I think it's all done by her, but in reality, it's been spread across multiple people without me being aware of it.

Can anyone explain in simple? terms why this is not done?

Thanks.

goborgh · Apr 2, 2022

"...it offers double the performance of the M1 Max."

Except it doesn't. All benchmarks are indicating that at best it can do 50% increase over the Max.

Puonti · Apr 2, 2022

Piggie said:
A bit like me saying I can give my order to one waitress (as I can only speak to one person at a time)
But without me knowing, she then tells a team of five people to make my order, once complete, they give the order back to her who gives it to me.
So, to me (the software?) I think it's all done by her, but in reality, it's been spread across multiple people without me being aware of it.

Can anyone explain in simple? terms why this is not done?

Please note that I could be way off - I'm not a processor engineer. That said, here's how I've understood it, using your analogy (I'm also curious to hear someone who actually knows what they're talking about explain it):

What you ask for from the waiter:

Veggie burger w/ garlic 2x patty hold tomato
Big fries w/ chili dip
Big cola hold ice

What the waiter gives to the chef:

2a1a7ffb0dfba41a217f39d134050bd7

The chef tries to split the work between the kitchen hands:

2a1a7ffb
0dfba41a
217f39d1
34050bd7

The kitchen hands have no idea what you ordered because split up like that, the order makes no sense. What you receive in the end might have some familiar-looking pieces, but it's not what you ordered.

Also the kitchen might now be on fire.

SidricTheViking · Apr 2, 2022

LinusR said:
Very nice article! There’s a typo:

Firstly, not really a typo, but I think it’s worth mentioning that the M1 Ultra is Mac Studio only, so impacting battery life doesn’t matter at this stage. I guess the second one is “M1 Pro”, not “M1 Ultra”.

I guess the second one is M1 Max, not Pro….

LV426 · Apr 2, 2022

Piggie said:
I need to ask what some may consider a very stupid question, even though to me it's a valid question to ask, but my total lack of knowledge about the way software talks to hardware makes me feel it's a good question to ask.

As I understand it, you can buy the lowest spec cheapest M1 Apple Product, and the most expensive $4000+ M1 Ultra product, and for "single core" applications, of which there are loads and loads, they will both perform exactly the same, which I'm sure there are many who even now don't realise this fact.

So my question is this:

Why is it not possible for the latest hardware to fool software, and the hardware itself spread the command sent to it from software across multiple cores, and then send back the result.

A bit like me saying I can give my order to one waitress (as I can only speak to one person at a time)
But without me knowing, she then tells a team of five people to make my order, once complete, they give the order back to her who gives it to me.
So, to me (the software?) I think it's all done by her, but in reality, it's been spread across multiple people without me being aware of it.

Can anyone explain in simple? terms why this is not done?

Thanks.

Multi-threaded applications tend to be quite complicated, requiring special programming techniques to run efficiently. If a programmer can identify sections of an application that can run in parallel with other sections he will write code to spread the load, and the CPU will duly oblige by running tasks across the multiple cores that are available.

This is hard to do, and it would be a real stretch to expect hardware to do the work of an experienced programmer by trying to create and delegate sub-tasks automatically.

Also, some algorithms used in code cannot be parcelled off to other processes.

It‘s really a job for the programmer to design the way that it runs tasks in parallel.

sudo-sandwich · Apr 2, 2022

LinusR said:
Very nice article! There’s a typo:

Firstly, not really a typo, but I think it’s worth mentioning that the M1 Ultra is Mac Studio only, so impacting battery life doesn’t matter at this stage. I guess the second one is “M1 Pro”, not “M1 Ultra”.

Could be on UPS.

willyx · Apr 2, 2022

It's fast

It's expensive

The end.

sudo-sandwich · Apr 2, 2022

Piggie said:
As I understand it, you can buy the lowest spec cheapest M1 Apple Product, and the most expensive $4000+ M1 Ultra product, and for "single core" applications, of which there are loads and loads, they will both perform exactly the same, which I'm sure there are many who even now don't realise this fact.

The base M1's single core is a little slower than an M1 Pro's. Very fast anyway.

Piggie said:
So my question is this:

Why is it not possible for the latest hardware to fool software, and the hardware itself spread the command sent to it from software across multiple cores, and then send back the result.

A bit like me saying I can give my order to one waitress (as I can only speak to one person at a time)
But without me knowing, she then tells a team of five people to make my order, once complete, they give the order back to her who gives it to me.
So, to me (the software?) I think it's all done by her, but in reality, it's been spread across multiple people without me being aware of it.

Can anyone explain in simple? terms why this is not done?

Thanks.

Say you ask for mashed potatoes and steak. It's obvious how to parallelize this with two cooks. One cook makes the potatoes, the other makes the steak. But if you try adding a third cook so two are working on the steak, that won't speed things up because only one can be moving the steak around at a time. They'll have to take turns, leaving one idle. They may even fight over the tongs. As the saying goes, too many cooks in the kitchen.

This is truer to computing than it sounds, even the fighting part (see lock contention). I hesitate to even call it an analogy because computers and software are really designed based on the human understanding of carrying out tasks.

It's possible that we'll have AI-generated code in the future that can assist in parallelizing things. But fundamentally, some tasks cannot scale to multiple CPU cores (or GPU, TPU, RAM, disks, whatever computing resource), or they can only scale to a certain number effectively.

chucker23n1 · Apr 2, 2022

Piggie said:
for "single core" applications, of which there are loads and loads, they will both perform exactly the same,

First of all: yes and no.

Suppose there were an M1 with just a single Firestorm core and no Icestorm cores. An M1 Basic, if you will. And now you run your app. It’s true: it would be the same exact core, running at the same clock, as one of the 16 Firestorm cores in the M1 Ultra.

But that scenario isn’t quite realistic, because:

1) you always have dozens of background processes. You probably have other GUI apps running that do processing. And then there’s system stuff running, too. As soon as you have a second core, that leaves more for the foreground!

2) even your foreground app likely isn’t fully single-threaded. For example, it might use one thread to render the UI and listen to events (mouse clicks, key presses, other stuff such as notifications), and another for a computation task.

So already, two cores is definitely beneficial. However, there are diminishing returns. Do four cores still help? Probably. Ten? Eh, maybe. Twenty, like on the Ultra? Only for some people. This is also part of why some of those cores are Icestorm instead: for the majority of people, there’s no point in having that many performance cores. It’d be a waste of energy.

Piggie said:
Why is it not possible for the latest hardware to fool software, and the hardware itself spread the command sent to it from software across multiple cores, and then send back the result.

Some stuff can be parallelized, but a lot cannot.

Piggie said:
A bit like me saying I can give my order to one waitress (as I can only speak to one person at a time)
But without me knowing, she then tells a team of five people to make my order, once complete, they give the order back to her who gives it to me.
So, to me (the software?) I think it's all done by her, but in reality, it's been spread across multiple people without me being aware of it.

Yes, but that’s it right there: it wouldn’t make sense to parallelized what the waitress does. Can five waitresses take up orders from five customers faster than one? Yes. But can five waitresses take them faster from just one customer? Probably not. Maybe one could be an expert on the drinks, and another on the foods, but at the end of the day, there’s only one customer to talk to. So, in this scenario, the customer becomes a bottleneck. You could throw fifty cores at the task and it wouldn’t help.

But then we get to cooking. Indeed, a lot of this can get parallelized. Cook A cuts peppers, cook B prepared meat, cook C looks for seasoning. But again, you can’t spread the food to multiple ovens. It’s one piece of food.

And finally, only one person can bring the finished plate to the customer.

Let’s bring that back to tech.

Suppose you want to render an image. This is actually a task that can be parallelized very well. If you have two cores, one core could do the top half, and the other could do the bottom half. If you have sixteen, you split the image into sixteen pieces.

But a lot of tasks don’t work like that at all. Instead, they have dependencies and bottlenecks. Perhaps you need to load a file. Loading it in two threads won’t make that go faster; there’s only one disk, and it’s already optimized to fetching data as fast as it can. Instead, it’ll in fact slow the disk down, as it’ll now keep switching between one task (say, loading the first half) and the other. Perhaps you need to fetch data from a web service. That’ll take as long as it takes. A faster network connection might help; more CPU definitely will not.

And I haven’t even gotten into sync issues. So the cook who cut the peppers is done, but the other cook who was preparing the meat didn’t wait for the sauce to be finished, because the code doesn’t check such prerequisites safely. Instead, the meat is already cooking, without the sauce. Oops. The food isn’t quite ruined, but it won’t taste great.

Threads (and more abstractly, tasks) need to carefully check that other tasks are done before they can proceed, and doing so can create a lot more overheard than you would have if you dimply ran the tasks sequentially on a single core.

chucker23n1 · Apr 2, 2022

Realityck said:
Look a 128K Mac was $2499 and that was the low end.

Well, yes and no. It was the only Mac for multiple years.

Realityck said:
Mac II were used by consumers, not just businesses.

The Macintosh II was clearly positioned as the high end, though. It morphed into the Quadra, then the Power Macintosh, then the Mac Pro.

The II was the high-end alternative to the SE, and eventually the Classic and LC.

Consumer vs. Pro doesn't mean "the other group can't use". Lots of writers, who are professionals at their jobs, use the MacBook Air. But it's also very much a consumer product. Conversely, yes, I bought a MacBook Pro before I got paid full-time to use it. But that's an enthusiast thing and certainly not the norm.

The Mac Studio, especially once you add the Ultra to it, is much closer to high-end "pro" needs than it is to "consumer" needs.

chucker23n1 · Apr 2, 2022

sudo-sandwich said:
The base M1's single core is a little slower than an M1 Pro's. Very fast anyway.

I imagine that's mostly the memory bandwidth, which is tripled. The core itself runs at the same clock.

polyphenol · Apr 2, 2022

Piggie said:
Why is it not possible for the latest hardware to fool software, and the hardware itself spread the command sent to it from software across multiple cores, and then send back the result.

However amazing any SoC, there comes a point at which increasing cores, increasing clock speed, etc., hits the buffers - at least for the current time. If nothing else, cooling gets more and more difficult.

I come from the era of mainframes. When your workload was too great for the single processor, there were various options for dual, triple, quad processors. But from the systems I knew, all required the system software to manage the spread across processors, dynamically moving loads around. And keeping in sync with each other.

(Some could perform clever tricks like adding and removing processors from running configurations. Sometimes, you could even drop a processor, and start that processor up with a different operating system.)

Apple might find that the highest possible total workload needs something similar in concept. In Mac Studio terms, imagine putting two, three or four Ultra chips in one case. And having some fancy software, which could well be AI-based, pushing the workload around the SoCs.

In other words, enhancement beyond the best current SoC needs some fancy load management. Until the next even more powerful SoC comes out, of course!

AppleFan735 · Apr 2, 2022

Sf844 said:
It’s true that the new chips are great for rendering YouTube videos, mixing tracks and Photoshop. Youtubers and DJs say that is very professional and important work. Who am I to say otherwise?

On the other hand, you need an Intel processor to run a stable full Excel, Autocad, Solidworks, Siemens NX, Bloomberg Terminal and other overwhelming number of professional programs that only work on x86.

You know, those things that people who create and run finance, companies, the government, planes, rockets, ships, cars, buildings, iphones, m1 processors and other similar unimportant things use. So as far as I’m concerned, professionals use Intel.

What you named are all enterprise software. SJ himself admitted that Apple is near non-existent in enterprise world because, I will paraphrase but the gist was: "In enterprise the people that actually use the product are not the ones that make the purchasing decision, and people that do make these decisions are often confused." Apple decided it's a lot easier to convince the consumer to buy it's product directly (thus all the consumer-grade & creative software) instead of trying to convince the few decision makers in the enterprise to deploy their fleet of computers. Since Apple couldn't gain a foothold in the enterprise market (and basically gave up on the segment entirely) the only game in town was Microsoft which ended up dominating the enterprise world, and therefore you'll see everything enterprise-related is all x86 and Windows based. I myself have iPad, AirPods and iPhone but a Windows PC - because of my work.

chucker23n1 · Apr 2, 2022

polyphenol said:
I come from the era of mainframes. When your workload was too great for the single processor, there were various options for dual, triple, quad processors. But from the systems I knew, all required the system software to manage the spread across processors, dynamically moving loads around. And keeping in sync with each other.

The managing is no longer the hard problem, but keeping them in sync certainly is, and having neatly-defined self-contained tasks in the first place is.

polyphenol · Apr 2, 2022

chucker23n1 said:
The managing is no longer the hard problem, but keeping them in sync certainly is, and having neatly-defined self-contained tasks in the first place is.

One system I used would keep two real time clocks on different processors synced. But they did ensure that one processor used the odd numbers and the other the even numbers. (I think each clock actually updated every four microseconds, allowing for a quad processor regime.) And even if you read the RTC in two successive instructions, the system would delay the second read until the RTC had incremented. Thus, the same RTC time could never occur on two processors.

Oh yes, I agree - splitting the workload up is absolutely key.

chucker23n1 · Apr 2, 2022

BostonQuad said:
I can see why doubling most things would be useful (performance cores, GPU cores, neural engine cores, etc). But does doubling the secure enclave accomplish anything? If so, what?

Good question.

It's plausible to me that:

if you increase the storage, so you store more hashes (more fingerprints, more faces)
and if you then run all of those hashes in parallel

that it would help enable new use cases. For example, a multi-user iPad that can store the faces of the entire family (rather than of just one person), and can log you in to the correct account immediately.

macOS kind of has this, with Touch ID, where it doesn't just recognize a fingerprint but also the user it is associated with.

To do this, you'd need more storage and also potentially more processing power, and you could totally put each hash on its own core.

The_Joker13 · Apr 2, 2022

Sf844 said:
It’s true that the new chips are great for rendering YouTube videos, mixing tracks and Photoshop. Youtubers and DJs say that is very professional and important work. Who am I to say otherwise?

On the other hand, you need an Intel processor to run a stable full Excel, Autocad, Solidworks, Siemens NX, Bloomberg Terminal and other overwhelming number of professional programs that only work on x86.

You know, those things that people who create and run finance, companies, the government, planes, rockets, ships, cars, buildings, iphones, m1 processors and other similar unimportant things use. So as far as I’m concerned, professionals use Intel.

Interesting, both me and my business partner who run our solar business use macs exclusively. He handles all the finances, which includes using excel, among other things. While I handle all the engineering and design work. As far as I’m concerned, you could not be more wrong.

chucker23n1 · Apr 2, 2022

Sf844 said:
It’s true that the new chips are great for rendering YouTube videos, mixing tracks and Photoshop. Youtubers and DJs say that is very professional and important work. Who am I to say otherwise?

On the other hand, you need an Intel processor to run a stable full Excel, Autocad, Solidworks, Siemens NX, Bloomberg Terminal and other overwhelming number of professional programs that only work on x86.

You know, those things that people who create and run finance, companies, the government, planes, rockets, ships, cars, buildings, iphones, m1 processors and other similar unimportant things use. So as far as I’m concerned, professionals use Intel.

You're right that some niche software continues to be Windows- and x86-only. The market trend is certainly shifting away from that, though. Even if it weren't, your conclusion is silly. Tons of "professionals" don't rely on the software you listed. A lot of software development these days happens on macOS and Linux. A lot of graphic design has always happened on macOS. Etc.

But even if we take your examples:

you say "full Excel". That's technically true, but the Mac version isn't really that limited, and I would wager a lot of people who live in Excel full-time use it just fine. (It's also a historical footnote at this point, but fun fact: Excel originated on the Mac.)
Autodesk does have multiple products for the Mac.
I'm not sure why you bring up Solidworks as another CAD vendor.
…or Siemens NX as a third CAD app. I guess you like CAD? You realize some careers™ with professionals™ don't involve CAD, right?
Bloomberg Terminal is often just used remotely these days.

falkon-engine · Apr 2, 2022

Piggie said:
I need to ask what some may consider a very stupid question, even though to me it's a valid question to ask, but my total lack of knowledge about the way software talks to hardware makes me feel it's a good question to ask.

As I understand it, you can buy the lowest spec cheapest M1 Apple Product, and the most expensive $4000+ M1 Ultra product, and for "single core" applications, of which there are loads and loads, they will both perform exactly the same, which I'm sure there are many who even now don't realise this fact.

So my question is this:

Why is it not possible for the latest hardware to fool software, and the hardware itself spread the command sent to it from software across multiple cores, and then send back the result.

A bit like me saying I can give my order to one waitress (as I can only speak to one person at a time)
But without me knowing, she then tells a team of five people to make my order, once complete, they give the order back to her who gives it to me.
So, to me (the software?) I think it's all done by her, but in reality, it's been spread across multiple people without me being aware of it.

Can anyone explain in simple? terms why this is not done?

Thanks.

Multi threading may sound simple but is hard because you have to ensure the code is properly synchronized. Because of how modern operating systems are designed, various programs and threads run for a few microseconds and then they are interrupted by the thread scheduler for another thread to run. The cpu is a limited resource and the thousands of threads need a timeslice of time to run. Your program has no real control as to when it is interrupted.

When your code is multithreaded and say writing to an array, what if it is interrupted by the OS while writing to the array, and then some other thread accesses that same memory location and writes to it before the first thread finished its task?

So you have to synchronize certain memory locations to ensure proper concurrent access and this slows the code down some. Multi threading is a beast and hard to find bugs because the bugs don’t always manifest themselves in the same way each time the program is run. These could be called race conditions. There’s also the issue of deadlocks when one thread is waiting on another other thread, which itself is waiting on the first…so the program locks up and doesn’t continue to do any work.

More cores doesn’t always equal more performance. There is a point of diminishing returns. It depends on the task and how parallelizable it is or isn’t. Sometimes one thread is best. Other times the programmer has to think extremely hard on how to split different tasks off into separate threads and then let the main thread manage the concurrent access problem. So rather than trying to debug a nightmare multithreaded program some developers just use one or a few threads to get the job done, but then leave most of the 20 cores under-utilized.

Da_Hood · Apr 2, 2022

polyphenol said:
However amazing any SoC, there comes a point at which increasing cores, increasing clock speed, etc., hits the buffers - at least for the current time. If nothing else, cooling gets more and more difficult.

I come from the era of mainframes. When your workload was too great for the single processor, there were various options for dual, triple, quad processors. But from the systems I knew, all required the system software to manage the spread across processors, dynamically moving loads around. And keeping in sync with each other.

(Some could perform clever tricks like adding and removing processors from running configurations. Sometimes, you could even drop a processor, and start that processor up with a different operating system.)

Apple might find that the highest possible total workload needs something similar in concept. In Mac Studio terms, imagine putting two, three or four Ultra chips in one case. And having some fancy software, which could well be AI-based, pushing the workload around the SoCs.

In other words, enhancement beyond the best current SoC needs some fancy load management. Until the next even more powerful SoC comes out, of course!

In other words bring back Grand Central Dispatch with neural engine support.

dlastmango · Apr 2, 2022

I’m not sure if I’m missing something obvious, but is there a clock speed difference between all of the M chips? Are the speeds listed anywhere? Or are they all the same?

chucker23n1 · Apr 2, 2022

Da_Hood said:
In other words bring back Grand Central Dispatch with neural engine support.

GCD wasn't some magical framework to parallelize code; it was just tooling to make scheduling easier. These days, better mechanisms such as async-await exist, but the general problem of concurrency and parallelism continues to be hard.

dlastmango said:
I’m not sure if I’m missing something obvious, but is there a clock speed difference between all of the M chips? Are the speeds listed anywhere? Or are they all the same?

There is not. All M1 variants run their Firestorm cores at 3.2 GHz, unless they're briefly boosting, or thermally throttling.

However, the M1 Pro, Max, and Ultra have higher memory bandwidth: the M1 had roughly 67 GiB/s, the M1 Pro thrice that, and the Ultra twelve times that. In practice, the M1 Max's 400 GiB/s is only really used by the GPU (and perhaps Neural Engine); the CPU maxes out at about 224 GiB/s even if you use all cores, and just over 102 GiB/s with just one core.

TL;DR: the M1 Pro, Max and Ultra will be slightly faster on memory-heavy tasks. But other than that, yep, the cores are the same.

The A14, which also has Firestorm cores, instead ran the cores at 3.0 GHz. So it'll be a bit slower. It's also more thermally constrained.

Freeangel1 · Apr 2, 2022

hagjohn · Apr 2, 2022

cloudphrenia said:
hmmm, hopefully the M2 family add HDMI 2.1. and, I'd like to see what the performance would be like if power efficiency didn't matter. I'm willing to pay a little more on my electric bill for a high end workstation.

Forget HDMI all together and put in an extra TB/USB 4.0 port and use that. That way you get an extra useable port, instead of a port you may never use. I'm just not sure why people want to waste a port on HDMI.

chucker23n1 · Apr 2, 2022

hagjohn said:
Forget HDMI all together and put in an extra TB/USB 4.0 port and use that. That way you get an extra useable port, instead of a port you may never use. I'm just not sure why people want to waste a port on HDMI.

To connect a display that exists in the real world, not fantasyland.

Apple's M1 Ultra Chip: Everything You Need to Know

macrumors G3

macrumors member

macrumors 68000

macrumors 6502

macrumors 68000

Suspended

macrumors regular

Suspended

macrumors G3

macrumors G3

macrumors G3

macrumors 68020

macrumors regular

macrumors G3

macrumors 68020

macrumors G3

macrumors regular

macrumors G3

macrumors 65816

macrumors regular

macrumors 6502

macrumors G3

Suspended

macrumors 68000

macrumors G3

Our Staff