Developer Delves Into Reasons Why Apple's M1 Chip is So Fast

Bensonvong · Dec 1, 2020

cmaier said:
I wonder what's holding apple back from releasing the M1 to the PC market. It would be a real kick to the balls to both Intel and AMD.

the same reason they don’t license out MacOS? I mean if say Dell was going to buy M1 chips for their labtop, they would certainly want to have MacOS on it. No way they buy M1 and hope to run Windows ARM.

So yeah, Apple isn’t going to sell M1 to the market because it would make them have to compete directly with all the other labtop makers for the MacOS market.

cmaier · Dec 1, 2020

bernstein said:
Nope. The Problem has always been technology: Manufacturing.
For decades there was a constant :

Intel's Manufacturing Prowess was the undisputed king. At least 2 generations ahead of everyone else.

Right now, intel's best is two generations behind TSMC's 5nm.

the easiest way to show you are wrong is to point out that RISC trounced CISC in performance and performance per watt in the 1990’s, and yet got nowhere. Exponential x704, DEC alpha, etc. There was no meaningful manufacturing disadvantage from 1992-1997.

mjtomlin71 · Dec 1, 2020

Alan Wynn · Dec 1, 2020

cmaier said:
The 1990’s were a golden age of awesome risc machines. Purple SGI MIPS boxes, DEC Alphas, pizza-box Sun SPARC boxes, boring ol‘ IBM RS-6000’s, HP PA-RISC, ...

Still have a Gecko (PA-RISC) that runs OPENSTEP.

cmaier said:
A lot of cpu architects had a lot of fun back then

Yup. Unfortunately, none could ever generate the volume to fund ongoing development.

aknabi · Dec 1, 2020

lkrupp said:
Not only will they deny it, they will claim it’s fraud. Intel fanboys are not going to just go away quitely. Remember, Amiga fanboys are still in denial that their “superior” platform died a horrible death at the hands of Apple.

The big test will come when the M1’s successor goes into the iMac, the iMac Pro, and finally the Mac Pro.

Regardless of if x86 fanboys concede or stay in denial, the reality of the benchmark counts and what people actually choose will determine the winner... they're complaining, dubious/gaslighting/fact-free counter-claims don't mean anything...

you know kinda like elections.

Article is a nice read, though some explanations may be a bit "off" (as someone who studied processor design)...but good enough for general consumption...

Also note the areas where he explains things like OoO execution where you can see Apple has good runway for more individual core performance (as opposed to Intel/AMD which have hit a complexity wall)...

GoodWheaties · Dec 1, 2020

hot-gril said:
25W CPU beats a 10 or 15W CPU in multi-core and still loses in single-core. That's not impressive.

we’re seeing 18W active power, going up to around 22W in average workloads, and peaking around 27W in compute heavy workloads

This was quoted from AnandTech. These figures are from the M1 Mac Mini. It is not 10W, more like 20-25W

Joelist · Dec 1, 2020

It is a good article, even explaining (as some of us here have been doing for weeks) how Apple Silicon is not the same as other ARM. AS does some things like micro-ops and from that OOE that are atypical in the ARM world and its massive 8 decoders plus a huge ReOrder Buffer. Anandtech (as is their norm) got under the hood like this a while ago also and noted Apple's "humongous" artitecture.

Alan Wynn · Dec 1, 2020

GoodWheaties said:
we’re seeing 18W active power, going up to around 22W in average workloads, and peaking around 27W in compute heavy workloads

This was quoted from AnandTech. These figures are from the M1 Mac Mini. It is not 10W, more like 20-25W

From that same article:

It’s to be noted with a huge disclaimer that because we are measuring AC wall power here, the power figures aren’t directly comparable to that of battery-powered devices, as the Mac mini’s power supply will incur a efficiency loss greater than that of other mobile SoCs, as well as TDP figures contemporary vendors such as Intel or AMD publish.

It’s especially important to keep in mind that the figure of what we usually recall as TDP in processors is actually only a subset of the figures presented here, as beyond just the SoC we’re also measuring DRAM and voltage regulation overhead, something which is not included in TDP figures nor your typical package power readout on a laptop.

Meaning the M1 Mac mini’s total power draw was 27 watts at peak. A bit hard to get a like to like comparison of TDP to the AMD parts. However, this will all be moot in 3 to 6 months when the next set of (higher performance) Apple Silicon machines come out and we can measure them against the competition again. Once there is a full range, we will know.

m1maverick · Dec 1, 2020

"...Intel and AMD are in a tough spot because of the limitations of the CISC instruction set..."

Lord knows how many times I've heard this in previous RISC versus CISC discussions.

mjtomlin71 · Dec 1, 2020

cmaier · Dec 1, 2020

m1mavrerick said:
"...Intel and AMD are in a tough spot because of the limitations of the CISC instruction set..."

Lord knows how many times I've heard this in previous RISC versus CISC discussions.

And it’s always been true.

You need a friggen’ STATE MACHINE to decode x86 instructions. It takes multiple pipe stages to decode. You miss a branch prediction and you’re hosed. The instruction decoders on the x86-64 machines I worked on were bigger than the integer units. Just nuts. (Not to mention the giant load/store units necessary to cope with all the crazy x86 addressing modes).

techwhiz · Dec 1, 2020

MacRumors said:
Engheim believes that Intel and AMD are in a tough spot because of the limitations of the CISC instruction set and their business models that don't make it easy to create end-to-end chip solutions for PC manufacturers.

Engheim's full article is well worth reading for those who are interested in how the M1 works and the technology that Apple has adopted to take a giant leap forward in computing performance.

Article Link: Developer Delves Into Reasons Why Apple's M1 Chip is So Fast

He clearly has no eff'n idea about modern CPU architecture and RISC vs CISC.
ARM hasn't adhered to a pure RISC architecture in years. It's heavily pipelined, has branch prediction, speculative execution and all instructions are not a single size for starters.

The unified memory architecture works fine since the memory is on chip; it does not scale when you need more memory.

techwhiz · Dec 1, 2020

saulinpa said:
How can they call this RISC when they now have a more complex instruction set? Adding instructions for encryption, graphics, signal processing, etc. This is more the definition of CISC.

It's not RISC any longer. At least not the way it was originally defined.

Alan Wynn · Dec 1, 2020

cmaier said:
And it’s always been true.

Yeah, but what do you know? It is not like you have designed both RISC and CISC machines, including the x86_64 parts that all the anti-ARM people love!

You need a friggen’ STATE MACHINE to decode x86 instructions. It takes multiple pipe stages to decode. You miss a branch prediction and you’re hosed.

Yeah, but once AMD and Intel move to the 2nm process and import their branch prediction from the future, they will never miss. So there! /s

The instruction decoders on the x86-64 machines I worked on were bigger than the integer units. Just nuts. (Not to mention the giant load/store units necessary to cope with all the crazy x86 addressing modes).

This is one of the points that just makes clear how much headroom Apple has over the AMD/Intel architecture. All those extra transistors can be used for more valuable things.

techwhiz · Dec 1, 2020

cmaier said:
The article is wrong in that “simultaneously” doesn’t really mean “simultaneously,” but anyway... (they probably can read simultaneously, but not write - depends how many read and write ports the RAM has)

There is an L2 cache, and L1 caches, though. Not clear how these are synchronized with the GPU. Typically each CPU core has its own L1, and they share the L2. This means if an L1 is dirty (because someone wrote to it), and if the change is not reflected in the L2 or in main memory, then the GPU wouldn’t see the change. There are various ways around this (including “write-through” caches that always write all the way through to main memory or at least to L2, buses that advertise what addresses are dirty, etc.). Not much is known about what Apple is doing here.

I've used multi-port memories and the only limitation is that you cannot simultaneously write to the same address, but you can write to different addresss. I've used 4+ port emory that had fully independent reads and writes.

The the L2 cache, not a big deal to keep it coherent. Depends on how they treat the GPU it can be either another "CPU" adhering to the same protocols or it can be a coherent peripheral. Both are possible.
As far as keeping everything updated, it depends on their directory structure and snoop filter.

m1maverick · Dec 1, 2020

cmaier said:
And it’s always been true.

You need a friggen’ STATE MACHINE to decode x86 instructions. It takes multiple pipe stages to decode. You miss a branch prediction and you’re hosed. The instruction decoders on the x86-64 machines I worked on were bigger than the integer units. Just nuts. (Not to mention the giant load/store units necessary to cope with all the crazy x86 addressing modes).

Perhaps but Apple chose x86 over PPC back in 2006. The performance benefit of M1 has less to do with CISC versus RISC but rather Apple placing everything in one package including specialized processing units made possible by a 5nm process technology.

IMO the benefit of RISC is overrated. Decode logic is a small part of a CPU die these days.

ericwn · Dec 1, 2020

barnyard said:
Time for an M1 or M2 chip as an add-on to the current Mac Pro

But they want you to use that opportunity to get a complete new one - who wouldn’t

cmaier · Dec 1, 2020

m1mavrerick said:
Perhaps but Apple chose x86 over PPC back in 2006. The performance benefit of M1 has less to do with CISC versus RISC but rather Apple placing everything in one package including specialized processing units made possible by a 5nm process technology.

IMO the benefit of RISC is overrated. Decode logic is a small part of a CPU die these days.

That;s because Intel was executing and Motorola was not, and it had nothing to do with the pros or cons of RISC.

And, nothing personal, but unless you are a CPU designer or architect, your opinion on the overratedness of RISC is, itself, overrated.

cmaier · Dec 1, 2020

techwhiz said:
I've used multi-port memories and the only limitation is that you cannot simultaneously write to the same address, but you can write to different addresss. I've used 4+ port emory that had fully independent reads and writes.

The the L2 cache, not a big deal to keep it coherent. Depends on how they treat the GPU it can be either another "CPU" adhering to the same protocols or it can be a coherent peripheral. Both are possible.
As far as keeping everything updated, it depends on their directory structure and snoop filter.

Sure. These are all solvable, and my only point is that we don’t know what Apple is doing, and that the article that used the word “simultaneously” probably didn’t really mean that.

My other (poorly phrased) point was simply that it’s unusual for the GPU to participate in coherency protocols, and it’s not clear to me exactly how they did that. The memory latency demands for a GPU are different than for a CPU (GPU needs something closer to “real time,” so I was simply also expressing that I don’t know what apple has chosen to do to solve that issue.)

m1maverick · Dec 1, 2020

cmaier said:
That;s because Intel was executing and Motorola was not, and it had nothing to do with the pros or cons of RISC.

And, nothing personal, but unless you are a CPU designer or architect, your opinion on the overratedness of RISC is, itself, overrated.

I love how people on this site always qualify their insults with "nothing personal" or some such nonsense.

I'm not going to try and convince you of anything as it's obvious your mind is closed.

MacBH928 · Dec 1, 2020

cmaier said:
Problem with Risc has never been technology. Lots of RISC processors in the past have blown away their CISC competitors.

The problem has always been “does it seamlessly run Windows and existing Windows apps?”

BYOD, the mobile Arm hegemony, and Apple’s expertise at supporting multi-architecture code has finally broken the glass.

My understanding was that CISC was chosen because its faster, it can run multiple operations at one time while the RISC could run just 1 operation at one time. RISC was used mainly on devices that were more of an appliance mainly due to their low power usage. I mean, even on the *nix side of things they mainly use AMD and Intel which are CISC, no Windows needed there. It doesn't help that Apple abandoned IBM POWER PC(RISC) and switched architectures to Intel(CISC) just because PowerPC was not delivering as much performance.

cmaier · Dec 1, 2020

m1mavrerick said:
I love how people on this site always qualify their insults with "nothing personal" or some such nonsense.

I'm not going to try and convince you of anything as it's obvious your mind is closed.

My mind is closed based on having a phd in electrical engineering and having designed many cpus, including PowerPC, SPARC, x86, x86-64, and MIPS, yes.

lederermc · Dec 1, 2020

MacRumors said:
Apple's M1 chip is the fastest chip that Apple has ever released in a Mac based on single-core CPU benchmark scores, and it beats out many high-end Intel Macs when it comes to multi-core performance. Developer Erik Engheim recently shared a deep dive into the M1 chip, exploring the reasons why Apple's new processor is so much faster than the Intel chips that it replaces.

First and foremost, the M1 isn't a simple CPU. As Apple has explained, it's a System-on-a-Chip, which is a series of chips that are all housed together in one silicon package. The M1 houses an 8-core CPU, 8-core GPU (7-core in some MacBook Air models), unified memory, SSD controller, image signal processor, Secure Enclave, and tons more.

Intel and AMD also ship multiple microprocessors in a single package, but as Engheim describes, Apple has a leg up because rather than focusing on general purpose CPU cores like its competitors, Apple is focusing on specialized chips that handle specialized tasks.

In addition to the CPU (with high-performance and high-efficiency cores) and GPU, the M1 has a Neural Engine for machine learning tasks like voice recognition and camera processing, a built-in video decoder/encoder for power-efficient conversion of video files, the Secure Enclave to handle encryption, the Digital Signal Processor for handling mathematically intensive functions like decompressing music files, and the Image Processing Unit that speeds up tasks done by image processing apps.

Notably, there's also a new unified memory architecture that lets the CPU, GPU, and other cores exchange information between one another, and with unified memory, the CPU and GPU can access memory simultaneously rather than copying data between one area and another. Accessing the same pool of memory without the need for copying speeds up information exchange for faster overall performance.

All of these chips with specific purposes speed up specific tasks, leading to the improvements that people are seeing.Specialized chips have been in use for years, but Apple is taking a "more radical shift towards this direction," as Engheim describes. Other Arm chip makers like AMD are taking a similar approach, but Intel and AMD rely on selling general purpose CPUs and for licensing reasons, PC manufacturers like Dell and HP are likely not able to design a full SoC in house like Apple is able to do.

Apple is able integrate hardware and software in a way that's just not possible for most other companies to replicate, which is always something that's given the iPhone and iPad an edge over other smartphones and tablets.Along with the benefits of an in-house designed System-on-a-Chip, Apple is also using Firestorm CPU cores in the M1 that are "genuinely fast" and able to execute more instructions in parallel through Out-of-Order execution, RISC architecture, and some specific optimizations Apple has implemented, which Engheim has an in-depth explanation of.

Engheim believes that Intel and AMD are in a tough spot because of the limitations of the CISC instruction set and their business models that don't make it easy to create end-to-end chip solutions for PC manufacturers.

Engheim's full article is well worth reading for those who are interested in how the M1 works and the technology that Apple has adopted to take a giant leap forward in computing performance.

Article Link: Developer Delves Into Reasons Why Apple's M1 Chip is So Fast

I'm wondering if future Apple Silicon will be able to beat the multi-core benchmarks of the 28 core Xeon Mac Pro.

Alan Wynn · Dec 1, 2020

lederermc said:
I'm wondering if future Apple Silicon will be able to beat the multi-core benchmarks of the 28 core Xeon Mac Pro.

I will answer that. Yes. Probably pretty soon, in fact.

cmaier · Dec 1, 2020

Alan Wynn said:
I will answer that. Yes. Probably pretty soon, in fact.

It’s a $$ issue not a technical issue. If Apple thinks they can make money selling product with such chips, it will make them.

Developer Delves Into Reasons Why Apple's M1 Chip is So Fast

macrumors newbie

Suspended

macrumors 6502

macrumors 68030

macrumors 6502a

macrumors 6502a

macrumors 6502

macrumors 68030

macrumors 68000

macrumors 6502

Suspended

macrumors 65816

macrumors 65816

macrumors 68030

macrumors 65816

macrumors 68000

macrumors G5

Suspended

Suspended

macrumors 68000

macrumors G3

Suspended

macrumors 6502a

macrumors 68030

Suspended

Our Staff