Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

TigeRick

macrumors regular
Original poster
Oct 20, 2012
144
151
Malaysia
Apple has finally upgraded the SoC inside Ultra 2 to S9 SiP. We still don't know much about this SoC, thus I am creating this thread for keeping technical information and speed about current and future S-series from Apple. I still not sure about the amount of RAM inside S8 and S9, I remember seeing some tweets said about 1.5GB RAM, anyone can confirm?

First benchmark appeared from Speedometer 2.1, so far it shows +38% performance improvement on web applications. Not bad, the speed almost approaching iPhone 6s with A9 SoC. Is there any other benchmarking app??? If you guys can show me, I will update the table accordingly, thanks :)

FYI, Samsung's Galaxy Watch5 with 2xA55 @ 1.18GHz scored 7 runs/min 😂

The latest SoC from Qualcomm, Snapdragon W5+ with 4xA53 @ 1.7GHz scored 11 runs/min. Better than Samsung, this SoC rumored to be used for upcoming Pixel Watch 2

UltraUltra 2Ultra 3?
Year Announced202220232024
SoCS8 1.8GHz Dual CoreS9 ? Dual Core
Process NodeN7PN5
Transistors3.5 billion5.6 billion
ArchitectureA11's E-coreA15's E-core
GPU1 core - 64 ALU?1 core - 256 ALU?
RAM1.5 GB LPDDR4X? GB LPDDR4X
Neural EngineNA4-core
Storage32GB64GB
Speedometer 2.1
(Runs/min)
27.037.1


Speedometer 2.0.png
 
Last edited:
Apple has finally upgraded the SoC inside Ultra 2 to S9 SiP. We still don't know much about this SoC, thus I am creating this thread for keeping technical information and speed about current and future S-series from Apple. I still not sure about the amount of RAM inside S8 and S9, I remember seeing some tweets said about 1.5GB RAM, anyone can confirm?

First benchmark appeared from Speedometer 2.1, so far it shows +38% performance improvement on web applications. Not bad, the speed almost approaching iPhone 6s with A9 SoC. Is there any other benchmarking app??? If you guys can show me, I will update the table accordingly, thanks :)

FYI, Samsung's Galaxy Watch5 with 2xA55 @ 1.18GHz scored 7 runs/min 😂

The latest SoC from Qualcomm, Snapdragon W5+ with 4xA53 @ 1.7GHz scored 11 runs/min. Better than Samsung, this SoC rumored to be used for upcoming Pixel Watch 2

UltraUltra 2Ultra 3?
Year Announced202220232024
SoCS8 1.8GHz Dual CoreS9 ? Dual Core
Process NodeN7PN5
ArchitectureA11's E-coreA15's E-core
GPU1 core - 64 ALU?1 core - 256 ALU?
RAM1.5 GB LPDDR4X? GB LPDDR4X
Neural EngineNA4-core
Storage32GB64GB
Speedometer 2.1
(Runs/min)
27.037.1


View attachment 2284432
How are these benchmarks taken? I searched the AppStore for “speedometer” but there does not appear an iOS, leave alone watch app to exist?
How can you measure this without an app?
Can you please provide some background info?
 
How are these benchmarks taken? I searched the AppStore for “speedometer” but there does not appear an iOS, leave alone watch app to exist?
How can you measure this without an app?
Can you please provide some background info?

Just use the browser to click the link, no apps required
 
How/where do you get the GPU numbers?
I agree that one core is the obvious value, and the GPUs (before A17) were, like recent nVidia designs, basically
- a single core shares L1 (I, D, and scratchpad)
- is divided into four quadrants
- each quadrant has 32 lanes (and an I-L0 cache, though Apple's one is rather different from nVidia's)

Meaning that for the most obvious (and correctly crafted...) sort of benchmarks (something like how many FMA's the core can do per cycle) you should see a core as looking like it can do 128 FMA's per cycle.

BUT there are multiple ways things can become more interesting. I won't go into all of these except to point out that for Apple (nVidia does things very differently, but also very interesting) in the past (starting with M1) they have, in one lane, had a 32bit FMA unit and a 16bit FMA unit. This uses up more area, but saves power if, a lot of the time, you are doing genuine FP16 work and don't have to do that on FP32 hardware.
Now the obvious question is - if you have this duplicated hardware, can you get "double" use out of it? There are at least two possibilities.

One is that you add some (not much required) hardware to the FP32 execution unit so that it can operate on a SIMD2 pair of FP16's. This requires that you occasionally have SIMD2 work of that sort, but that's actually pretty common in graphics. If you do this, it will now (for some benchmarks) look like you have 256 rather than 128 FP units in a core.

Another alternative is if you can effectively provide two instructions per cycle in each quadrant AND you decide to (at least for situations) also execute FP16 instructions in the FP32 unit. Then a pair of instructions could execute two FP16's one in the FP16 and one in the FP32 unit in parallel.
There are multiple ways to provide two instructions per cycle, the trick is doing so at an acceptably low energy level. You could do superscalar statically scheduled code, or a form of VLIW, or even play timing games (nVidia's favorite way of handling this) so that, one way or another, the instruction pipeline operates at twice the frequency of the execution pipeline.

Point is, it's very interesting if some benchmarks are showing supposed 256 ALUs on a core, and then we would like to know many details! Isn't this the same as the A15/M2 GPU core (as far as I know, very definitely 128 ALU units, and no weird tricks like I have described)? Or did they slide in the A17 GPU (seems a strange decision, but who knows?) What EXACTLY is the benchmark that's suggesting 256 ALUs measuring?
 
How/where do you get the GPU numbers?
I agree that one core is the obvious value, and the GPUs (before A17) were, like recent nVidia designs, basically
- a single core shares L1 (I, D, and scratchpad)
- is divided into four quadrants
- each quadrant has 32 lanes (and an I-L0 cache, though Apple's one is rather different from nVidia's)

Meaning that for the most obvious (and correctly crafted...) sort of benchmarks (something like how many FMA's the core can do per cycle) you should see a core as looking like it can do 128 FMA's per cycle.

BUT there are multiple ways things can become more interesting. I won't go into all of these except to point out that for Apple (nVidia does things very differently, but also very interesting) in the past (starting with M1) they have, in one lane, had a 32bit FMA unit and a 16bit FMA unit. This uses up more area, but saves power if, a lot of the time, you are doing genuine FP16 work and don't have to do that on FP32 hardware.
Now the obvious question is - if you have this duplicated hardware, can you get "double" use out of it? There are at least two possibilities.

One is that you add some (not much required) hardware to the FP32 execution unit so that it can operate on a SIMD2 pair of FP16's. This requires that you occasionally have SIMD2 work of that sort, but that's actually pretty common in graphics. If you do this, it will now (for some benchmarks) look like you have 256 rather than 128 FP units in a core.

Another alternative is if you can effectively provide two instructions per cycle in each quadrant AND you decide to (at least for situations) also execute FP16 instructions in the FP32 unit. Then a pair of instructions could execute two FP16's one in the FP16 and one in the FP32 unit in parallel.
There are multiple ways to provide two instructions per cycle, the trick is doing so at an acceptably low energy level. You could do superscalar statically scheduled code, or a form of VLIW, or even play timing games (nVidia's favorite way of handling this) so that, one way or another, the instruction pipeline operates at twice the frequency of the execution pipeline.

Point is, it's very interesting if some benchmarks are showing supposed 256 ALUs on a core, and then we would like to know many details! Isn't this the same as the A15/M2 GPU core (as far as I know, very definitely 128 ALU units, and no weird tricks like I have described)? Or did they slide in the A17 GPU (seems a strange decision, but who knows?) What EXACTLY is the benchmark that's suggesting 256 ALUs measuring?
Well, it is in my Excel database with past collection.🤣

I cannot recall where I get the information, but you could search for A15's GPU; seems like A15 is the only A-series with such a high ALU (1280) with 600MHz speed net 1.5TFLOPS. And that's the reasons why A15's iPhones have the longest battery life due to low speed GPU.

Maybe @leman have some insights???
 
Apple has finally upgraded the SoC inside Ultra 2 to S9 SiP. We still don't know much about this SoC, thus I am creating this thread for keeping technical information and speed about current and future S-series from Apple. I still not sure about the amount of RAM inside S8 and S9, I remember seeing some tweets said about 1.5GB RAM, anyone can confirm?

First benchmark appeared from Speedometer 2.1, so far it shows +38% performance improvement on web applications. Not bad, the speed almost approaching iPhone 6s with A9 SoC. Is there any other benchmarking app??? If you guys can show me, I will update the table accordingly, thanks :)

FYI, Samsung's Galaxy Watch5 with 2xA55 @ 1.18GHz scored 7 runs/min 😂

The latest SoC from Qualcomm, Snapdragon W5+ with 4xA53 @ 1.7GHz scored 11 runs/min. Better than Samsung, this SoC rumored to be used for upcoming Pixel Watch 2

UltraUltra 2Ultra 3?
Year Announced202220232024
SoCS8 1.8GHz Dual CoreS9 ? Dual Core
Process NodeN7PN5
ArchitectureA11's E-coreA15's E-core
GPU1 core - 64 ALU?1 core - 256 ALU?
RAM1.5 GB LPDDR4X? GB LPDDR4X
Neural EngineNA4-core
Storage32GB64GB
Speedometer 2.1
(Runs/min)
27.037.1


View attachment 2284432
Not that it’s definitive, but the diagram Apple shows in their Sept. 12 event presentation (at 9:20) appears to have 4 GPU cores.
 
Not that it’s definitive, but the diagram Apple shows in their Sept. 12 event presentation (at 9:20) appears to have 4 GPU cores.
Of course who knows if the four rectangles in that presentation just correspond to four quadrants of a single core? It's not an outrageous level of artistic license. I mean 4 cores is basically an iPhone-level GPU, which seems insane!
 
Well, it is in my Excel database with past collection.🤣

I cannot recall where I get the information, but you could search for A15's GPU; seems like A15 is the only A-series with such a high ALU (1280) with 600MHz speed net 1.5TFLOPS. And that's the reasons why A15's iPhones have the longest battery life due to low speed GPU.

Maybe @leman have some insights???
Of course that assumes the 600MHz is trustworthy...

I tend to be EXTREMELY dubious of microbenchmarks run by people who have no idea how to interpret the results, or what is being tested. It's hard to screw up the meaning of a GB6 score, but it's VERY easy to take some code that you don't understand that's SUPPOSED to measure a specific feature of a core and have it fail because that feature now operates in a way whoever wrote the original microbenchmark did not expect. That's why so much of my first two PDFs is about drawing graphs, looking for anomalies, and trying to understand how results do (or don't) fit together.
 
  • Like
Reactions: smalm
Of course who knows if the four rectangles in that presentation just correspond to four quadrants of a single core? It's not an outrageous level of artistic license. I mean 4 cores is basically an iPhone-level GPU, which seems insane!
Oh, I agree. I was a bit shocked and perplexed when I watched the event live. It’s also odd that they claimed it is only 30% faster if there are 4x the cores. OTOH, it could use much less power if designed to be clocked much lower while also being able to switch cores off. Either way, 256 ALUs is pretty remarkable no matter how it’s configured.

I forget if they mentioned the lithography node— do we know if it’s produced using N3B? I would assume that it is, since it’s the first new watch SiP since S6.
 
I cannot recall where I get the information, but you could search for A15's GPU; seems like A15 is the only A-series with such a high ALU (1280) with 600MHz speed net 1.5TFLOPS. And that's the reasons why A15's iPhones have the longest battery life due to low speed GPU.

Maybe @leman have some insights???

An Apple GPU core has 128 “compute units” (4x32-wide data-parallel SIMD processors). A15 GPU runs at 1.33Ghz if I remember correctly.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.