Cinebench R23 score of M2 Max.

Philip Turner · Jan 5, 2023

mr_roboto said:
I wouldn't take anything from a site like that at face value. It could all be made up, how could you know?

But really I just wanted to mention that v9 does not imply SVE, it's still an optional extension. When Arm v9 comes to Apple's CPU cores, they could choose to not use it, same as they did in M1 and M2. (the SVE extension was already available at the time M1 was in development)

Yes, Fugaku invented it for their ARM supercomputer. But ARMv9 would be the perfect time to implement SVE.

RedTheReader · Jan 5, 2023

jdb8167 said:
I think the remarks are that Cinebench is heavily optimized for Intel x86-64 and not so much for Apple silicon. I'm not sure why you think those remarks are controversial?

They're controversial because they don't disguise the fact that most software is optimized (or indeed, even available) for x86 and only x86.

senttoschool · Jan 5, 2023

RedTheReader said:
THey're controversial because they don't disguise the fact that most software is optimized (or indeed, even available) for x86 and only x86.

False. A lot of software have been compiled for ARM already. Cinebench is a benchmark that is terrible whether it's optimized for ARM or not. AMD and Intel continue to tout Cinebench in marketing despite the fact that it does a terrible job at measuring general CPU performance. It's good at one thing and one thing only: measure the performance for Cinemark which is niche software in a niche.

Cinebench is the reason why Intel is throwing 16 small cores into their CPU and only 8 high-performance cores. Almost no software can make use of 16 small cores but Intel added them because they want to win in Cinbench, which favors many slow cores.

Cinebench is counter productive for the CPU industry. We are getting CPUs that are optimizing for Cinebench rather than applications people actually use.

Lihp8270 · Jan 5, 2023

Cinebench is as good a benchmark as any other.

They all simply test a predetermined scenario. Complain about platform optimisation, but if your intended use case is rendering using Maxon software. Then Cinebench is an accurate indicator of what performance to expect.

Benchmarks are only any use if your usage matches the testing methodology.

leman · Jan 6, 2023

Lihp8270 said:
They all simply test a predetermined scenario. Complain about platform optimisation, but if your intended use case is rendering using Maxon software. Then Cinebench is an accurate indicator of what performance to expect.

This is a common misconception. Cinema 4D users hardly use the CPU renderer, most seem to rely on gPU-accelerated Redshift. You could maybe use CB23 as a proxy for Blender CPU render times but that has similar problem.

The biggest issue is that CB23 is assumed to represent the “general performance” of a GPU, and that simply doesn’t make any sense. It’s a benchmark of running dependent computation chains using Intel AVX SIMD extensions, that’s it.

Xiao_Xi · Jan 6, 2023

leman said:
The biggest issue is that CB23 is assumed to represent the “general performance” of a GPU, and that simply doesn’t make any sense.

Who claims that? Isn't Cinebench R23 a good predictor of multiprocesssing (at least on PC CPUs)?

leman · Jan 6, 2023

Xiao_Xi said:
Who claims that? Isn't Cinebench R23 a good predictor of multiprocesssing (at least on PC CPUs)?

Define "multiprocessing". Cinebench, due to the nature of the code, scales very well with SMT and multiple cores on x86. This is not necessarily representative of most multi-threaded workloads.

Xiao_Xi · Jan 6, 2023

leman said:
Define "multiprocessing". Cinebench, due to the nature of the code, scales very well with SMT and multiple cores on x86. This is not necessarily representative of most multi-threaded workloads.

Multi-processing = multi-core/multi-threaded

Even if other multi-threaded workloads doesn't scale as well as R23, wouldn't there be a linear relationship between the them? Could an x86 CPU outperform another CPU in R23 but run slower with a real multi-threaded workload?
Isn't there a linear relationship between multi-core GB5 and Cinebench R23 results (at least on x86 CPUs?

Sam_The_Apple_Boy · Jan 6, 2023

Philip Turner said:
10 performance cores?

This seems plausible. The M2 base is also 1398 MHz. And for memory, the M1 Max has the same LPDDR5 and the M2 Max supposedly does. However if it's the same clock speed, I'm worried it's not the ray-tracing GPU from the A16. They probably just copied the clock speed from M2, they couldn't measure it.

Fabrication process 5 nm

Not sure this is accurate.

I'd be disappointed. Where's ARMv9 with SVE? Although to be fair, they'd probably use the Sawtooth cores from the A16.

This is showing 5nm because apple was first testing M2 Max built TSMC's 5nm technology. Now, as the 3nm chips are getting mass-produced, Apple might test that 3nm chip again.

MayaUser · Jan 6, 2023

Sam_The_Apple_Boy said:
This is showing 5nm because apple was first testing M2 Max built TSMC's 5nm technology. Now, as the 3nm chips are getting mass-produced, Apple might test that 3nm chip again.

My sources tells me this M2 Max will be for the Macbook Pro 14" and 16" and maybe Mac Studio
My sources also tells me that Apple tested 4 prototypes 5nm with 4nm now with 3nm and 2nm so everything is possible but im sure it will be 5nm or 4nm or 3nm or 2nm
My sources says that the 14" 16" will come in silver and space black, but also prototypes shows a shade of gold, green, pink,yellow etc
So everything is possible
My sources tells me that the Mbp will keep their prices, but Apple is thinking to lower the price or higher the price, so everything is possible

MayaUser · Jan 6, 2023

Sam_The_Apple_Boy said:
That's Good. I am happy you know that.

everybody knew that...it was from all of our sources, our sources knew that since the summer of 2022
New Macbook pros are coming in 2023 for sure, my sources saying

ponzicoinbro · Jan 6, 2023

leman said:
This is a common misconception. Cinema 4D users hardly use the CPU renderer, most seem to rely on gPU-accelerated Redshift. You could maybe use CB23 as a proxy for Blender CPU render times but that has similar problem.

I have been out of that loop for a while. Is the workflow these days pre-viz with Redshift and final render on CPU?

mr_roboto · Jan 6, 2023

Xiao_Xi said:
Who claims that? Isn't Cinebench R23 a good predictor of multiprocesssing (at least on PC CPUs)?

Only for a very narrow subset of what "multiprocessing" covers.

When trying to use multiple cores, there's two important patterns in software. One is called "embarrassingly parallel", and the other is basically everything else.

Embarrassingly parallel problems are those where the workload can easily be split up between an arbitrary number of cores, and there are no dependencies between the worker threads running on each core.

The harder problem is accelerating algorithms which have dependencies between threads. This arises when multiple threads need to edit the same data structure (it's very important that they don't both try to mess with the same thing at the same time), when one thread needs an interim result from another, and so forth. Thread synchronization points are both difficult to get right and a source of scaling problems.

If the entire software world consisted of nothing but embarrassingly parallel throughput algorithms, that would be great! CPU manufacturers could pack big chips full of hundreds of efficiency cores, and these would provide far more compute throughput than today's CPUs. (The colloquial name for such a chip is a "flock-of-chickens" design. Do you want to pull your plow with a flock of chickens, or a big strong ox?)

Unfortunately, in the real world, embarrassingly parallel is not the only thing. It's still very important to provide high single thread performance. There's lots of software which either doesn't scale at all, or scales poorly beyond three or four cores, so you want at least some of the cores in the system to be as fast at ST as is practical.

Another factor is that the rise of GPGPU allows us to move lots of embarrassingly parallel work off to GPUs, which actually are flock-of-chickens designs. The availability of GPU compute is an argument for CPUs to remain things which are mostly optimized for serial computation. (This is even more true with unified memory SoC designs like Apple Silicon, where the penalty for moving data between CPU and GPU essentially doesn't exist.)

Cinebench is an embarrassingly parallel CPU benchmark. It would be perfectly happy with a flock-of-chickens CPU design. CPU raytracing algorithms usually backwards cast from the screen - you fire rays from pixel coordinates and trace back bounces until you find a light source. So you just divide the workload up into blocks of pixels and toss those work units at CPUs. It scales extremely well, and the only thread-to-thread dependencies are work assignment, which is one of the esasiest multithreaded programming problems.

Cinebench even visualizes its workload splitting for you. It assigns each CPU a square block of pixels to render and draws a box around it. You should have as many boxes visible as you have hardware threads, and as each CPU core (or thread) finishes one box it gets assigned a new one.

Philip Turner · Jan 6, 2023

mr_roboto said:
Cinebench even visualizes its workload splitting for you. It assigns each CPU a square block of pixels to render and draws a box around it. You should have as many boxes visible as you have hardware threads, and as each CPU core (or thread) finishes one box it gets assigned a new one.

Shouldn't we be using GPUs for raytracing? Even without hardware acceleration, it should be faster than CPU. And easier to work with. Metal has a great API you can experiment with on Apple silicon.

mr_roboto said:
where the penalty for moving data between CPU and GPU essentially doesn't exist.

It exists all right and it's quite huge. 150-200 microseconds for a round-trip to the GPU with Metal. Meanwhile, with CUDA it's 20 microseconds. You could just touch one virtual memory page and the Nvidia GPU quickly transfers it through PCI-e.

Sam_The_Apple_Boy · Jan 6, 2023

MayaUser said:
My sources tells me this M2 Max will be for the Macbook Pro 14" and 16" and maybe Mac Studio
My sources also tells me that Apple tested 4 prototypes 5nm with 4nm now with 3nm and 2nm so everything is possible but im sure it will be 5nm or 4nm or 3nm or 2nm
My sources says that the 14" 16" will come in silver and space black, but also prototypes shows a shade of gold, green, pink,yellow etc
So everything is possible
My sources tells me that the Mbp will keep their prices, but Apple is thinking to lower the price or higher the price, so everything is possible

How could Apple test 2nm when it is not even Launched? 2nm will be launched after 2024.

mr_roboto · Jan 6, 2023

Philip Turner said:
Shouldn't we be using GPUs for raytracing? Even without hardware acceleration, it should be faster than CPU. And easier to work with. Metal has a great API you can experiment with on Apple silicon.

Sure, but Cinebench is what it is. It uses an open source raytracing library written by Intel to make Intel CPUs look good. So it only uses the CPU, its SIMD acceleration consists of x86 intrinsics style code, and the only support for non-x86 SIMD is through emulation libraries which mechanically translate individual SSE intrinsics to equivalent sequences of native SIMD instructions.

Philip Turner said:
It exists all right and it's quite huge. 150-200 microseconds for a round-trip to the GPU with Metal. Meanwhile, with CUDA it's 20 microseconds. You could just touch one virtual memory page and the Nvidia GPU quickly transfers it through PCI-e.

This has to be an API thing. With unified memory, there's no need for a copy at all. There are probably some wrinkles, of course - the GPU has its own page table for example. So I can easily understand some RT latency based on API peculiarities, or not using the API in exactly the right way, or needing to adjust or synchronize page tables, etc.

Fundamentally, however, the underlying hardware should support dramatically less latency. PCIe has a lot of latency baked in - about 1 microsecond round trip for a single 32-bit read, in my experience. (It's really a networking protocol so this is not surprising.) For Apple Silicon, the round trip is through the on-chip SLC (system-level cache), which should be much faster.

Xiao_Xi · Jan 7, 2023

mr_roboto said:
One is called "embarrassingly parallel", and the other is basically everything else.

Can Cinebench predict the performance of non-embarrisingly parallel workloads?
Which benchmark can predict better the performance of non-embarrisingly parallel workloads?

I have created a Multi R23 vs Multi GB5 graph by randomly taking a very small sample of CPUs from the top of the list from the link.

The outlier is M1 Ultra.

If M1 Ultra would lie in the line that x86 CPUs creates (f(x) = 0.53*x + 3386.83, R2=0.9378), M1 Ultra would get 38455 in Cinebench R23, almost 1.6 times more than the actual result.

CPU	Cinebench R23	Geekbench 5
Intel Core i9 13900K	39536	24268
AMD Ryzen 9 7950X	38544	23729
AMD Ryzen 9 7900X	29614	20799
Intel Core i9 12900K	27368	17138
AMD Ryzen 9 5950X	26237	18636
Apple M1 Ultra	24035	23768
Intel Core i5 13600KF	23819	15430
AMD Ryzen 9 3950X	22923	14538
AMD Ryzen 7 7700X	19873	14455

Cinebench 2024 & R23 Benchmark Scores

See our regularly updated list of all CPUs tested in the Cinebench 2024 & R23 benchmarks (single and multi-core results).

nanoreview.net

mr_roboto said:
only support for non-x86 SIMD is through emulation libraries which mechanically translate individual SSE intrinsics to equivalent sequences of native SIMD instructions.

Out of curiosity, how do you know? Was it written in Embree's release notes?

Is the poor optimization of the ARM SIMDn instructions the only reason why the actual result of M1 Ultra in Cinebench R23 is so bad with respect to the theoretical result? Could the ARM SIMD instructions be worse than the x86 SIMD instructions?

galad · Jan 7, 2023

Blindlessly mapping x86 SIMD functions to NEON functions will leave a lot of performance on the table. It's better than nothing, but usually it can be possible to get much better performance with hand-made optimisations.

leman · Jan 7, 2023

Xiao_Xi said:
Can Cinebench predict the performance of non-embarrisingly parallel workloads?
Which benchmark can predict better the performance of non-embarrisingly parallel workloads?

I have created a Multi R23 vs Multi GB5 graph by randomly taking a very small sample of CPUs from the top of the list from the link.
View attachment 2138293
The outlier is M1 Ultra.

If M1 Ultra would lie in the line that x86 CPUs creates (f(x) = 0.53*x + 3386.83, R2=0.9378), M1 Ultra would get 38455 in Cinebench R23, almost 1.6 times more than the actual result.

CPU
Cinebench R23
Geekbench 5
Intel Core i9 13900K
39536
24268
AMD Ryzen 9 7950X
38544
23729
AMD Ryzen 9 7900X
29614
20799
Intel Core i9 12900K
27368
17138
AMD Ryzen 9 5950X
26237
18636
Apple M1 Ultra
24035
23768
Intel Core i5 13600KF
23819
15430
AMD Ryzen 9 3950X
22923
14538
AMD Ryzen 7 7700X
19873
14455

Cinebench 2024 & R23 Benchmark Scores

See our regularly updated list of all CPUs tested in the Cinebench 2024 & R23 benchmarks (single and multi-core results).

nanoreview.net

I didn’t do the math, but I’m quite confident that the best predictor of CB23 is the clock speed times number of cores. All these CPUs have the same SIMD capabilities, so in the end it’s all about clock frequency.

Xiao_Xi said:
Out of curiosity, how do you know? Was it written in Embree's release notes?

Embree is open source. It’s right there in the code. You are welcome to have a look.

Xiao_Xi said:
Is the poor optimization of the ARM SIMDn instructions the only reason why the actual result of M1 Ultra in Cinebench R23 is so bad with respect to the theoretical result?
Could the ARM SIMD instructions be worse than the x86 SIMD instructions?

If you develop and optimize for a particular platform, it makes sense that this platform will have the best performance. Translating some common x86 SIMD idioms to ARM SIMD will result in performance loss and via versa.

But again, clock frequency plays an important role. Current Apple Silicon models have no chance winning a raw SIMD punching contest agains x86. The SIMD throughput per clock is the same but x86 desktop is clocked significantly higher.

Xiao_Xi · Jan 7, 2023

leman said:
I didn’t do the math, but I’m quite confident that the best predictor of CB23 is the clock speed times number of cores. All these CPUs have the same SIMD capabilities, so in the end it’s all about clock frequency.

If the main difference in performance were those reasons and not how much it is optimized for x86 over ARM, the benchmark would be legit because those are implementation choices. It would be very unfair to praise an ARM CPU for being efficient because it can perform as well as an x86 at lower frequencies and to discredit a benchmark because it penalizes CPUs using lower frequencies.

What would the performance of a CPU using hybrid architecture?
clock speed P cores x number P cores + clock speed E cores+ number E cores?

thenewperson · Jan 7, 2023

Xiao_Xi said:
It would be very unfair to praise an ARM CPU for being efficient because it can perform as well as an x86 at lower frequencies and to discredit a benchmark because it penalizes CPUs using lower frequencies.

How could that be unfair? The ARM CPU should definitely be praised and the benchmark discredited in this scenario.

singhs.apps · Jan 7, 2023

mr_roboto said:
Sure, but Cinebench is what it is. It uses an open source raytracing library written by Intel to make Intel CPUs look good. So it only uses the CPU, its SIMD acceleration consists of x86 intrinsics style code, and the only support for non-x86 SIMD is through emulation libraries which mechanically translate individual SSE intrinsics to equivalent sequences of native SIMD instructions.

This has to be an API thing. With unified memory, there's no need for a copy at all. There are probably some wrinkles, of course - the GPU has its own page table for example. So I can easily understand some RT latency based on API peculiarities, or not using the API in exactly the right way, or needing to adjust or synchronize page tables, etc.

Fundamentally, however, the underlying hardware should support dramatically less latency. PCIe has a lot of latency baked in - about 1 microsecond round trip for a single 32-bit read, in my experience. (It's really a networking protocol so this is not surprising.) For Apple Silicon, the round trip is through the on-chip SLC (system-level cache), which should be much faster.

Thanks.

How do you see Apple going forward to catchup and/or innovate in such scenarios ?

What would a hypothetical Mac Pro silicon look like that matches or exceed the x86+GPGPU accelerators communicating over PCI-e workflows ( CLX is supposed to mitigate the current GPU to CPU communication bottlenecks ) ?

I am just trying to understand where Apple is going with its m series chips ( eg : will it always be the case that in pure raw throughput, Apple will lag behind because they need perf per watt on their better selling macs at the cost of desktop offerings, or the SOC has its own disadvantages :
- a ceiling to how high you can clock them
or
- they cannot scale well beyond two SOCs - like the rumoured cancellation of the mExtreme chips )

Xiao_Xi · Jan 7, 2023

thenewperson said:
How could that be unfair? The ARM CPU should definitely be praised and the benchmark discredited in this scenario.

Do you want to discredit a benchmark because it scales linearly with frequency?

Frequency is an implementation choice, with pros and cons. It seems unfair to me to choose benchmarks that praise the pros and discredit those that penalize the cons.

leman · Jan 7, 2023

Xiao_Xi said:
If the main difference in performance were those reasons and not how much it is optimized for x86 over ARM, the benchmark would be legit because those are implementation choices.

The problem with CB23 is not that’s it’s “unfair”. The problem is that’s it’s not a representative measure for general CPU performance.

If your primary usage is high-throughput SIMD, x86 is the undisputed king. There is simply no argument around it. You don’t need Cinebench to figure that out either. But those workflows are actually not that common. CPU raytracing, chess engines, some forms of HPC computing. That’s about it. If you only judge the performance of Apple Silicon by its CB23 score you are massively underestimating what it can do in other domains.

singhs.apps said:
How do you see Apple going forward to catchup and/or innovate in such scenarios ?

I know you asked @mr_roboto but I hope it’s fine with you if I also offer my 50 cents.

First, a quick observation. Apple SIMD CPU units in the current hardware are designed for flexibility. They have four independent 128-bit units in contrast to fewer but wider units used by x86 CPUs. This also explains why M1 performs exceedingly well in many flavors of scientific software which is less reliant on vectorization.

Now, to your question. I think there is evidence that Apple recognizes different types of workloads and designs separate solutions for them. I believe that their primary SIMD units will remained optimized for flexibility of use and low latency at the expense of the throughput. For high-throughput HPC-like work they will continue to improve the AMX coprocessor, which already today surpasses the capabilities of x86 machines. Note that modern ARM architectures make this kind split official by offering regular and a “streaming” vector processing modes (with the streaming mode operating very similar to Apple AMX). And finally, for raytracing Apples solution is the GPU. They already offer a state of the art raytracing GPU API with very generous data limits and there is evidence that their next generation of GPUs will offer raytracing acceleration.

I believe this approach is very promising, but Apple needs to continue improving the software side. Their matrix hardware is currently hidden from the programmer, and they need to give us direct access. Also, they need to improve the GPU programming model by offering virtual memory sharing between the CPU and the GPU as well as better data synchronization. Ideally one would have a programming model where different processors can communicate with very low latency (true for CPU and AMX currently but not CPU and GPU)

Xiao_Xi · Jan 7, 2023

leman said:
The problem with CB23 is not that’s it’s “unfair”. The problem is that’s it’s not a representative measure for general CPU performance.

Wouldn't other benchmarks have the same problem? Which benchmark would be better at predicting multithreaded performance?

Cinebench R23 score of M2 Max.

macrumors regular

macrumors 6502a

macrumors 68030

macrumors 65816

macrumors Core

macrumors 68000

macrumors Core

macrumors 68000

macrumors newbie

macrumors 68040

macrumors 68040

Suspended

macrumors 6502a

macrumors regular

macrumors newbie

macrumors 6502a

macrumors 68000

macrumors 6502a

macrumors Core

macrumors 68000

macrumors 65816

macrumors 6502a

macrumors 68000

macrumors Core

macrumors 68000

Our Staff