Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
I wouldn't take anything from a site like that at face value. It could all be made up, how could you know?

But really I just wanted to mention that v9 does not imply SVE, it's still an optional extension. When Arm v9 comes to Apple's CPU cores, they could choose to not use it, same as they did in M1 and M2. (the SVE extension was already available at the time M1 was in development)
Yes, Fugaku invented it for their ARM supercomputer. But ARMv9 would be the perfect time to implement SVE.
 
I think the remarks are that Cinebench is heavily optimized for Intel x86-64 and not so much for Apple silicon. I'm not sure why you think those remarks are controversial?

They're controversial because they don't disguise the fact that most software is optimized (or indeed, even available) for x86 and only x86.
 
Last edited:
THey're controversial because they don't disguise the fact that most software is optimized (or indeed, even available) for x86 and only x86.
False. A lot of software have been compiled for ARM already. Cinebench is a benchmark that is terrible whether it's optimized for ARM or not. AMD and Intel continue to tout Cinebench in marketing despite the fact that it does a terrible job at measuring general CPU performance. It's good at one thing and one thing only: measure the performance for Cinemark which is niche software in a niche.

Cinebench is the reason why Intel is throwing 16 small cores into their CPU and only 8 high-performance cores. Almost no software can make use of 16 small cores but Intel added them because they want to win in Cinbench, which favors many slow cores.

Cinebench is counter productive for the CPU industry. We are getting CPUs that are optimizing for Cinebench rather than applications people actually use.
 
  • Like
Reactions: jdb8167
Cinebench is as good a benchmark as any other.

They all simply test a predetermined scenario. Complain about platform optimisation, but if your intended use case is rendering using Maxon software. Then Cinebench is an accurate indicator of what performance to expect.

Benchmarks are only any use if your usage matches the testing methodology.
 
They all simply test a predetermined scenario. Complain about platform optimisation, but if your intended use case is rendering using Maxon software. Then Cinebench is an accurate indicator of what performance to expect.

This is a common misconception. Cinema 4D users hardly use the CPU renderer, most seem to rely on gPU-accelerated Redshift. You could maybe use CB23 as a proxy for Blender CPU render times but that has similar problem.

The biggest issue is that CB23 is assumed to represent the “general performance” of a GPU, and that simply doesn’t make any sense. It’s a benchmark of running dependent computation chains using Intel AVX SIMD extensions, that’s it.
 
The biggest issue is that CB23 is assumed to represent the “general performance” of a GPU, and that simply doesn’t make any sense.
Who claims that? Isn't Cinebench R23 a good predictor of multiprocesssing (at least on PC CPUs)?
 
Who claims that? Isn't Cinebench R23 a good predictor of multiprocesssing (at least on PC CPUs)?

Define "multiprocessing". Cinebench, due to the nature of the code, scales very well with SMT and multiple cores on x86. This is not necessarily representative of most multi-threaded workloads.
 
Define "multiprocessing". Cinebench, due to the nature of the code, scales very well with SMT and multiple cores on x86. This is not necessarily representative of most multi-threaded workloads.
Multi-processing = multi-core/multi-threaded

Even if other multi-threaded workloads doesn't scale as well as R23, wouldn't there be a linear relationship between the them? Could an x86 CPU outperform another CPU in R23 but run slower with a real multi-threaded workload?
Isn't there a linear relationship between multi-core GB5 and Cinebench R23 results (at least on x86 CPUs?
 
10 performance cores?


This seems plausible. The M2 base is also 1398 MHz. And for memory, the M1 Max has the same LPDDR5 and the M2 Max supposedly does. However if it's the same clock speed, I'm worried it's not the ray-tracing GPU from the A16. They probably just copied the clock speed from M2, they couldn't measure it.

Fabrication process5 nm
Not sure this is accurate.



I'd be disappointed. Where's ARMv9 with SVE? Although to be fair, they'd probably use the Sawtooth cores from the A16.
This is showing 5nm because apple was first testing M2 Max built TSMC's 5nm technology. Now, as the 3nm chips are getting mass-produced, Apple might test that 3nm chip again.
 
This is showing 5nm because apple was first testing M2 Max built TSMC's 5nm technology. Now, as the 3nm chips are getting mass-produced, Apple might test that 3nm chip again.
My sources tells me this M2 Max will be for the Macbook Pro 14" and 16" and maybe Mac Studio
My sources also tells me that Apple tested 4 prototypes 5nm with 4nm now with 3nm and 2nm so everything is possible but im sure it will be 5nm or 4nm or 3nm or 2nm
My sources says that the 14" 16" will come in silver and space black, but also prototypes shows a shade of gold, green, pink,yellow etc
So everything is possible
My sources tells me that the Mbp will keep their prices, but Apple is thinking to lower the price or higher the price, so everything is possible
 
  • Haha
Reactions: Sam_The_Apple_Boy
This is a common misconception. Cinema 4D users hardly use the CPU renderer, most seem to rely on gPU-accelerated Redshift. You could maybe use CB23 as a proxy for Blender CPU render times but that has similar problem.

I have been out of that loop for a while. Is the workflow these days pre-viz with Redshift and final render on CPU?
 
Who claims that? Isn't Cinebench R23 a good predictor of multiprocesssing (at least on PC CPUs)?
Only for a very narrow subset of what "multiprocessing" covers.

When trying to use multiple cores, there's two important patterns in software. One is called "embarrassingly parallel", and the other is basically everything else.

Embarrassingly parallel problems are those where the workload can easily be split up between an arbitrary number of cores, and there are no dependencies between the worker threads running on each core.

The harder problem is accelerating algorithms which have dependencies between threads. This arises when multiple threads need to edit the same data structure (it's very important that they don't both try to mess with the same thing at the same time), when one thread needs an interim result from another, and so forth. Thread synchronization points are both difficult to get right and a source of scaling problems.

If the entire software world consisted of nothing but embarrassingly parallel throughput algorithms, that would be great! CPU manufacturers could pack big chips full of hundreds of efficiency cores, and these would provide far more compute throughput than today's CPUs. (The colloquial name for such a chip is a "flock-of-chickens" design. Do you want to pull your plow with a flock of chickens, or a big strong ox?)

Unfortunately, in the real world, embarrassingly parallel is not the only thing. It's still very important to provide high single thread performance. There's lots of software which either doesn't scale at all, or scales poorly beyond three or four cores, so you want at least some of the cores in the system to be as fast at ST as is practical.

Another factor is that the rise of GPGPU allows us to move lots of embarrassingly parallel work off to GPUs, which actually are flock-of-chickens designs. The availability of GPU compute is an argument for CPUs to remain things which are mostly optimized for serial computation. (This is even more true with unified memory SoC designs like Apple Silicon, where the penalty for moving data between CPU and GPU essentially doesn't exist.)

Cinebench is an embarrassingly parallel CPU benchmark. It would be perfectly happy with a flock-of-chickens CPU design. CPU raytracing algorithms usually backwards cast from the screen - you fire rays from pixel coordinates and trace back bounces until you find a light source. So you just divide the workload up into blocks of pixels and toss those work units at CPUs. It scales extremely well, and the only thread-to-thread dependencies are work assignment, which is one of the esasiest multithreaded programming problems.

Cinebench even visualizes its workload splitting for you. It assigns each CPU a square block of pixels to render and draws a box around it. You should have as many boxes visible as you have hardware threads, and as each CPU core (or thread) finishes one box it gets assigned a new one.
 
Cinebench even visualizes its workload splitting for you. It assigns each CPU a square block of pixels to render and draws a box around it. You should have as many boxes visible as you have hardware threads, and as each CPU core (or thread) finishes one box it gets assigned a new one.
Shouldn't we be using GPUs for raytracing? Even without hardware acceleration, it should be faster than CPU. And easier to work with. Metal has a great API you can experiment with on Apple silicon.
where the penalty for moving data between CPU and GPU essentially doesn't exist.
It exists all right and it's quite huge. 150-200 microseconds for a round-trip to the GPU with Metal. Meanwhile, with CUDA it's 20 microseconds. You could just touch one virtual memory page and the Nvidia GPU quickly transfers it through PCI-e.
 
My sources tells me this M2 Max will be for the Macbook Pro 14" and 16" and maybe Mac Studio
My sources also tells me that Apple tested 4 prototypes 5nm with 4nm now with 3nm and 2nm so everything is possible but im sure it will be 5nm or 4nm or 3nm or 2nm
My sources says that the 14" 16" will come in silver and space black, but also prototypes shows a shade of gold, green, pink,yellow etc
So everything is possible
My sources tells me that the Mbp will keep their prices, but Apple is thinking to lower the price or higher the price, so everything is possible
How could Apple test 2nm when it is not even Launched? 2nm will be launched after 2024.
 
Shouldn't we be using GPUs for raytracing? Even without hardware acceleration, it should be faster than CPU. And easier to work with. Metal has a great API you can experiment with on Apple silicon.
Sure, but Cinebench is what it is. It uses an open source raytracing library written by Intel to make Intel CPUs look good. So it only uses the CPU, its SIMD acceleration consists of x86 intrinsics style code, and the only support for non-x86 SIMD is through emulation libraries which mechanically translate individual SSE intrinsics to equivalent sequences of native SIMD instructions.

It exists all right and it's quite huge. 150-200 microseconds for a round-trip to the GPU with Metal. Meanwhile, with CUDA it's 20 microseconds. You could just touch one virtual memory page and the Nvidia GPU quickly transfers it through PCI-e.
This has to be an API thing. With unified memory, there's no need for a copy at all. There are probably some wrinkles, of course - the GPU has its own page table for example. So I can easily understand some RT latency based on API peculiarities, or not using the API in exactly the right way, or needing to adjust or synchronize page tables, etc.

Fundamentally, however, the underlying hardware should support dramatically less latency. PCIe has a lot of latency baked in - about 1 microsecond round trip for a single 32-bit read, in my experience. (It's really a networking protocol so this is not surprising.) For Apple Silicon, the round trip is through the on-chip SLC (system-level cache), which should be much faster.
 
One is called "embarrassingly parallel", and the other is basically everything else.
Can Cinebench predict the performance of non-embarrisingly parallel workloads?
Which benchmark can predict better the performance of non-embarrisingly parallel workloads?

I have created a Multi R23 vs Multi GB5 graph by randomly taking a very small sample of CPUs from the top of the list from the link.
1673080257379.png

The outlier is M1 Ultra.

If M1 Ultra would lie in the line that x86 CPUs creates (f(x) = 0.53*x + 3386.83, R2=0.9378), M1 Ultra would get 38455 in Cinebench R23, almost 1.6 times more than the actual result.

CPU
Cinebench R23
Geekbench 5
Intel Core i9 13900K​
39536​
24268​
AMD Ryzen 9 7950X​
38544​
23729​
AMD Ryzen 9 7900X​
29614​
20799​
Intel Core i9 12900K​
27368​
17138​
AMD Ryzen 9 5950X​
26237​
18636​
Apple M1 Ultra​
24035​
23768​
Intel Core i5 13600KF​
23819​
15430​
AMD Ryzen 9 3950X​
22923​
14538​
AMD Ryzen 7 7700X​
19873​
14455​

only support for non-x86 SIMD is through emulation libraries which mechanically translate individual SSE intrinsics to equivalent sequences of native SIMD instructions.
Out of curiosity, how do you know? Was it written in Embree's release notes?

Is the poor optimization of the ARM SIMDn instructions the only reason why the actual result of M1 Ultra in Cinebench R23 is so bad with respect to the theoretical result? Could the ARM SIMD instructions be worse than the x86 SIMD instructions?
 
Blindlessly mapping x86 SIMD functions to NEON functions will leave a lot of performance on the table. It's better than nothing, but usually it can be possible to get much better performance with hand-made optimisations.
 
Can Cinebench predict the performance of non-embarrisingly parallel workloads?
Which benchmark can predict better the performance of non-embarrisingly parallel workloads?

I have created a Multi R23 vs Multi GB5 graph by randomly taking a very small sample of CPUs from the top of the list from the link.
View attachment 2138293
The outlier is M1 Ultra.

If M1 Ultra would lie in the line that x86 CPUs creates (f(x) = 0.53*x + 3386.83, R2=0.9378), M1 Ultra would get 38455 in Cinebench R23, almost 1.6 times more than the actual result.

CPU
Cinebench R23
Geekbench 5
Intel Core i9 13900K​
39536​
24268​
AMD Ryzen 9 7950X​
38544​
23729​
AMD Ryzen 9 7900X​
29614​
20799​
Intel Core i9 12900K​
27368​
17138​
AMD Ryzen 9 5950X​
26237​
18636​
Apple M1 Ultra​
24035​
23768​
Intel Core i5 13600KF​
23819​
15430​
AMD Ryzen 9 3950X​
22923​
14538​
AMD Ryzen 7 7700X​
19873​
14455​

I didn’t do the math, but I’m quite confident that the best predictor of CB23 is the clock speed times number of cores. All these CPUs have the same SIMD capabilities, so in the end it’s all about clock frequency.


Out of curiosity, how do you know? Was it written in Embree's release notes?

Embree is open source. It’s right there in the code. You are welcome to have a look.

Is the poor optimization of the ARM SIMDn instructions the only reason why the actual result of M1 Ultra in Cinebench R23 is so bad with respect to the theoretical result?
Could the ARM SIMD instructions be worse than the x86 SIMD instructions?

If you develop and optimize for a particular platform, it makes sense that this platform will have the best performance. Translating some common x86 SIMD idioms to ARM SIMD will result in performance loss and via versa.

But again, clock frequency plays an important role. Current Apple Silicon models have no chance winning a raw SIMD punching contest agains x86. The SIMD throughput per clock is the same but x86 desktop is clocked significantly higher.
 
I didn’t do the math, but I’m quite confident that the best predictor of CB23 is the clock speed times number of cores. All these CPUs have the same SIMD capabilities, so in the end it’s all about clock frequency.
If the main difference in performance were those reasons and not how much it is optimized for x86 over ARM, the benchmark would be legit because those are implementation choices. It would be very unfair to praise an ARM CPU for being efficient because it can perform as well as an x86 at lower frequencies and to discredit a benchmark because it penalizes CPUs using lower frequencies.

What would the performance of a CPU using hybrid architecture?
clock speed P cores x number P cores + clock speed E cores+ number E cores?
 
It would be very unfair to praise an ARM CPU for being efficient because it can perform as well as an x86 at lower frequencies and to discredit a benchmark because it penalizes CPUs using lower frequencies.
How could that be unfair? The ARM CPU should definitely be praised and the benchmark discredited in this scenario.
 
Sure, but Cinebench is what it is. It uses an open source raytracing library written by Intel to make Intel CPUs look good. So it only uses the CPU, its SIMD acceleration consists of x86 intrinsics style code, and the only support for non-x86 SIMD is through emulation libraries which mechanically translate individual SSE intrinsics to equivalent sequences of native SIMD instructions.


This has to be an API thing. With unified memory, there's no need for a copy at all. There are probably some wrinkles, of course - the GPU has its own page table for example. So I can easily understand some RT latency based on API peculiarities, or not using the API in exactly the right way, or needing to adjust or synchronize page tables, etc.

Fundamentally, however, the underlying hardware should support dramatically less latency. PCIe has a lot of latency baked in - about 1 microsecond round trip for a single 32-bit read, in my experience. (It's really a networking protocol so this is not surprising.) For Apple Silicon, the round trip is through the on-chip SLC (system-level cache), which should be much faster.
Thanks.

How do you see Apple going forward to catchup and/or innovate in such scenarios ?

What would a hypothetical Mac Pro silicon look like that matches or exceed the x86+GPGPU accelerators communicating over PCI-e workflows ( CLX is supposed to mitigate the current GPU to CPU communication bottlenecks ) ?

I am just trying to understand where Apple is going with its m series chips ( eg : will it always be the case that in pure raw throughput, Apple will lag behind because they need perf per watt on their better selling macs at the cost of desktop offerings, or the SOC has its own disadvantages :
- a ceiling to how high you can clock them
or
- they cannot scale well beyond two SOCs - like the rumoured cancellation of the mExtreme chips )
 
How could that be unfair? The ARM CPU should definitely be praised and the benchmark discredited in this scenario.
Do you want to discredit a benchmark because it scales linearly with frequency?

Frequency is an implementation choice, with pros and cons. It seems unfair to me to choose benchmarks that praise the pros and discredit those that penalize the cons.
 
  • Like
Reactions: impulse462
If the main difference in performance were those reasons and not how much it is optimized for x86 over ARM, the benchmark would be legit because those are implementation choices.

The problem with CB23 is not that’s it’s “unfair”. The problem is that’s it’s not a representative measure for general CPU performance.

If your primary usage is high-throughput SIMD, x86 is the undisputed king. There is simply no argument around it. You don’t need Cinebench to figure that out either. But those workflows are actually not that common. CPU raytracing, chess engines, some forms of HPC computing. That’s about it. If you only judge the performance of Apple Silicon by its CB23 score you are massively underestimating what it can do in other domains.



How do you see Apple going forward to catchup and/or innovate in such scenarios ?

I know you asked @mr_roboto but I hope it’s fine with you if I also offer my 50 cents.

First, a quick observation. Apple SIMD CPU units in the current hardware are designed for flexibility. They have four independent 128-bit units in contrast to fewer but wider units used by x86 CPUs. This also explains why M1 performs exceedingly well in many flavors of scientific software which is less reliant on vectorization.

Now, to your question. I think there is evidence that Apple recognizes different types of workloads and designs separate solutions for them. I believe that their primary SIMD units will remained optimized for flexibility of use and low latency at the expense of the throughput. For high-throughput HPC-like work they will continue to improve the AMX coprocessor, which already today surpasses the capabilities of x86 machines. Note that modern ARM architectures make this kind split official by offering regular and a “streaming” vector processing modes (with the streaming mode operating very similar to Apple AMX). And finally, for raytracing Apples solution is the GPU. They already offer a state of the art raytracing GPU API with very generous data limits and there is evidence that their next generation of GPUs will offer raytracing acceleration.

I believe this approach is very promising, but Apple needs to continue improving the software side. Their matrix hardware is currently hidden from the programmer, and they need to give us direct access. Also, they need to improve the GPU programming model by offering virtual memory sharing between the CPU and the GPU as well as better data synchronization. Ideally one would have a programming model where different processors can communicate with very low latency (true for CPU and AMX currently but not CPU and GPU)
 
The problem with CB23 is not that’s it’s “unfair”. The problem is that’s it’s not a representative measure for general CPU performance.
Wouldn't other benchmarks have the same problem? Which benchmark would be better at predicting multithreaded performance?
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.