Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
Wouldn't other benchmarks have the same problem? Which benchmark would be better at predicting multithreaded performance?

If you want to predict performance with a benchmark it’s best to choose a workload that’s similar in nature. Or many of those (that’s how gaming benchmarks work).

Modern general-purpose benchmarks like GB or SPEC run multiple workloads of different types to see how hardware deals with them.
 
People aren't saying it's horrible, they're saying it's not properly optimized for Apple Silicon, and thus should not be used as a cross-platform comparator. If we compare it to more platform-neutral benchmarks like SPEC or Geekbench, it appears that Cinebench suffers about a 10% AS-specific penalty.
Where does the claim of Geekbenchbeing platform-neutral come from? I remember that Anandtech specifically refused to use Geekbench for comparing the perfotmance of iPhones and Android phones because they deemed it not suitable for cross platform comparisons. Also, according to data posted in this discussion Geekbench produced the scores with 13% difference between Linux and Windows while running on the same PC. Many people believe that Geekbench favors Apple.
 
Where does the claim of Geekbenchbeing platform-neutral come from? I remember that Anandtech specifically refused to use Geekbench for comparing the perfotmance of iPhones and Android phones because they deemed it not suitable for cross platform comparisons.

That was a criticism of earlier versions of GB. The latest 5 is much better in this regard. Just needs some improvements in GPU benchmarks and longer running times.


Also, according to data posted in this discussion Geekbench produced the scores with 13% difference between Linux and Windows while running on the same PC.

From cursory glance this seems to discuss version 4. I dint think there is such a big discrepancy in 5. Of course, OS allocator performance plays a big role for many benchmarks. Windows memory allocator has been atrocious historically, but they have improved a lot in recent times.

Many people believe that Geekbench favors Apple.

Yeah, sure, because many people believe that Apple products have to be slower. That’s why all benchmarks showing iPhone performing similar to Intel laptops from 2016 on were dismissed as “it’s impossible”.
 
Also, according to data posted in this discussion Geekbench produced the scores with 13% difference between Linux and Windows while running on the same PC.
The Geekbench developer thought that the difference might be due to the use of different compilers.
Linux and Windows do not use the same compiler. Some details on the compilers used for each platform are available via our page on the individual Geekbench 4 CPU workloads.

If that was the problem, it should be fixed because all platforms use clang in GB5.
1673127381186.png


Many people believe that Geekbench favors Apple.
I have tried to compare a MacBook and a PC laptop with the same Intel chip, but I don't understand why the MacBook loses in single core and wins in multi core.
 
Can Cinebench predict the performance of non-embarrisingly parallel workloads?
Which benchmark can predict better the performance of non-embarrisingly parallel workloads?
In order:
No.
That's a good question, and I do not have an answer at hand.

You looked at GB5. I'm not entirely sure about this, I would have to go through the document GB publishes with details on each test, but I think I remember reading that it follows the SPEC model where tests are single-threaded, and multi-threaded results are generated by running several independent copies at the same time. Which would mean it tests only the embarrassingly parallel case.

Benchmarking is really hard, cross platform benchmarking doubly so! There are vanishingly few public resources who do it well. You need a lot of deep technical knowledge and the will to spend a lot of time exploring why results are the way they are rather than accepting them at immediate face value. If you don't do this, you can easily fool yourself. AnandTech was the last enthusiast site I trusted to at least attempt to do this the right way, and unfortunately the person responsible for their highest quality work in recent times (Andrei F.) left to change careers.

Out of curiosity, how do you know? Was it written in Embree's release notes?
It's written in Embree's source code - it is an open source project, so you can go look.

Speaking of which, to correct a possible wrong impression from my earlier post - I don't think Intel wrote Embree to cheat on benchmarks. It's quite believable that they created and open sourced it simply because it's difficult to write and optimize SIMD code, they employ a bunch of experts who know how to do that, and they decided it would be useful to have those experts create a reference x86 SIMD raytracer for others to crib from. And if it got picked up and used by real-world application software, bonus! It all works out to benefit Intel.

The weird SSE emulation Arm code is someone else's contribution, IIRC.

Is the poor optimization of the ARM SIMDn instructions the only reason why the actual result of M1 Ultra in Cinebench R23 is so bad with respect to the theoretical result? Could the ARM SIMD instructions be worse than the x86 SIMD instructions?
The "emulate SSE" approach is a good way to get things running with mediocre performance on Arm, and a terrible way to optimize for maximum performance on Arm. The current project maintainers might not want to accept a truly separate SIMD codepath for Arm, though, as that increases maintenance burden a lot. With the current approach, whenever they make a bugfix or feature enhancement on the SSE code, it automatically translates to Arm. (I haven't tried to look on the project's public communications spaces to find out whether there's been any debates about that.)

As for "worse", that's very tricky. Generally speaking I've heard good things about the design of Arm's Neon SIMD, and lots of grumbles about SSE (less so about AVX), but I haven't personally coded for any of these so I can't speak from experience.
 
  • Like
Reactions: Xiao_Xi and leman
That was a criticism of earlier versions of GB. The latest 5 is much better in this regard. Just needs some improvements in GPU benchmarks and longer running times.




From cursory glance this seems to discuss version 4. I dint think there is such a big discrepancy in 5. Of course, OS allocator performance plays a big role for many benchmarks. Windows memory allocator has been atrocious historically, but they have improved a lot in recent times.



Yeah, sure, because many people believe that Apple products have to be slower. That’s why all benchmarks showing iPhone performing similar to Intel laptops from 2016 on were dismissed as “it’s impossible”.
Well, your arguments are as good as the claim that Cinebench is not optimized for Apple. It's just that - a claim without any proof.
 
So, if the library is open source it means it's not optimized for Apple? If it is open source it means it can be optimized by anyone for anything.
You can read the code? I guess you can’t apparently.

People are working on optimizing it but as @leman said, a whole scale replacement of the SIMD code path for Apple silicon could be rejected by the maintainers.
 
You can read the code? I guess you can’t apparently.

People are working on optimizing it but as @leman said, a whole scale replacement of the SIMD code path for Apple silicon could be rejected by the maintainers.
That does not prove anything. People are also working on optimizing it for x86. Any changes could be rejected. Are you saying the maintainers are preventing optimizations for Apple? Why would they do it?
 
That does not prove anything. People are also working on optimizing it for x86. Any changes could be rejected. Are you saying the maintainers are preventing optimizations for Apple? Why would they do it?
You clearly are not a software developer so this might be hard to understand but different kinds of changes can make life much more difficult for the developers who maintain a repository. Adding a whole new code path to optimize for a significantly different architecture is a large change. Many open source projects are part time and very large pull requests will be rejected because the developers literally don’t have enough time to review them.

Sometimes this can lead to a forked library but in that case there is no obligation by the developers of Cinebench to adopt the forked library.

Anyway the current code can be read by developers like @leman and they can see sub-optimal Neon SIMD code. You are just going to take it on faith or get someone you trust with the appropriate skill to do their own review.
 
You clearly are not a software developer so this might be hard to understand but different kinds of changes can make life much more difficult for the developers who maintain a repository. Adding a whole new code path to optimize for a significantly different architecture is a large change. Many open source projects are part time and very large pull requests will be rejected because the developers literally don’t have enough time to review them.

Sometimes this can lead to a forked library but in that case there is no obligation by the developers of Cinebench to adopt the forked library.

Anyway the current code can be read by developers like @leman and they can see sub-optimal Neon SIMD code. You are just going to take it on faith or get someone you trust with the appropriate skill to do their own review.
I am a software developer (though not for the open source projects). Companies (including Apple) contribute to open source projects all the time. The claim that leman can see sub-optimal code has very little merits. He did not even claim that himself.
 
I am a software developer (though not for an open source projects). Companies (including Apple) contribute to open source projects all the time. The claim that leman can see sub-optimal code has very little merits. He did not even claim that himself.
Then go read the code yourself. This is a very odd conversation if you a programmer. The code is easily available.

Edit: There is a whole arm directory with emulation, sse2neon, and avx2neon headers.
 
  • Like
Reactions: Botts85
Then go read the code yourself. This is a very odd conversation if you a programmer. The code is easily available.

Edit: There is a whole arm directory with emulation, sse2neon, and avx2neon headers.
You are being naive if you think that one can find badly optimized code by just reviewing it. It's possible but only in rare situations where:
  • One is an expert in the field
  • One is very familiar with specific tas
  • The code is really bad
  • One spends significant time on code analysis
Normally one would have to use performance profiling and other techniques to find issues with the code.
 
You are being naive if you think that one can find badly optimized code by just reviewing it. It's possible but only in rare situations where:
  • One is an expert in the field
  • One is very familiar with specific tas
  • The code is really bad
  • One spends significant time on code analysis
Normally one would have to use performance profiling and other techniques to find issues with the code.
It’s kind of what I do for a living but thanks for the advice.
 
You are being naive if you think that one can find badly optimized code by just reviewing it.
It's weird how insistent you are that there's any controversy about this. Take a couple minutes to go look at the source. You are going to find plenty of evidence to support the idea that this is a library written and optimized for AVX and SSE, with only a quick-and-dirty NEON port. No, you do not need a deep review to come to this conclusion, not if you've ever dabbled in this domain of software (which is to say, assembly mixed into C code by way of intrinsics).
 
  • Like
Reactions: jdb8167
Blender has gotten some since optmization in thenlast year.

Perhaps a good indicator but I guess it depends also what you looks for.

I like to use blender opendata it as you can compare cpu and gpu performance unlike cinebench.

It also gets the end result from 3 different scenes which is think is a better way then cinebench too.

 
Last edited:
You are being naive if you think that one can find badly optimized code by just reviewing it. It's possible but only in rare situations where:
  • One is an expert in the field
  • One is very familiar with specific tas
  • The code is really bad
  • One spends significant time on code analysis
Normally one would have to use performance profiling and other techniques to find issues with the code.

Andrei at Anandtech agrees that Geekbench is better than Cinebench for CPU benchmarking. Read the discussion here.
 
Oof. Is those are real, then 13th Gen Intel is really laying the SmackDown on Apple’s candy ass.
 
Andrei at Anandtech agrees that Geekbench is better than Cinebench for CPU benchmarking. Read the discussion here.
That's an interesting discussion. However, "better for CPU benchmarking" does not necessarily have anything to do with cross-platform. It may be good for benchmarking CPUs but still not useful for cross-platform comparisons.
 
Well, your arguments are as good as the claim that Cinebench is not optimized for Apple. It's just that - a claim without any proof.

You are welcome to check the GB scores for the same hardware on different platform. You will find only minor differences that can be explained by differences in power management and kernel code.

People are working on optimizing it but as @leman said, a whole scale replacement of the SIMD code path for Apple silicon could be rejected by the maintainers.

Well, let’s not get too deep into the controversy territory here. Apple just recently submitted a big patch to Embree that improved the performance by 8%, it got accepted just fine.


Anyway the current code can be read by developers like @leman and they can see sub-optimal Neon SIMD code. You are just going to take it on faith or get someone you trust with the appropriate skill to do their own review.
I am a software developer (though not for the open source projects). Companies (including Apple) contribute to open source projects all the time. The claim that leman can see sub-optimal code has very little merits. He did not even claim that himself.

The code in question has been written and hand-optimized for Intel‘s SSE and AVX2. The ARM version emulates the semantics of x86 SIMD by implementing x86 platform intrinsics via ARM SIMD instructions. For some simpler cases, there is a 1 to 1 mapping. For other cases the ARM code is suboptimal (because ARM does not natively implement all of Intels SIMD patterns).

This development approach in itself precludes proper optimizations for ARM processors because it assumes the x86 mental model. The primary target is x86, ARM is an afterthought, and the optimization opportunities are limited by the original code which is designed with an x86 processor in mind.

Let me be perfectly clear about one thing though. I do not believe that the current Apple Silicon can outperform x86 designs in this particular domain. As I wrote before, M1 SIMD designed for flexibility of use and not for maximal throughput. M1 simply lacks the cache bandwidth and the clock speed to be a throughput monster. Apple could rewrite Embree from scratch using hand-optimized ARM code and they would still likely only gain another 10, maybe 15% at best. x86 would still be ahead in this type of workload simply due to their hardware advantage. And that’s perfectly fine. Intel pays a huge power usage price to be a performance leader in this niche (it needs approx two to three times more power to produce the same result as Apple). And Apples solution to this class of problems is the GPU, not the CPU.
 
  • Like
Reactions: jdb8167
You are welcome to check the GB scores for the same hardware on different platform. You will find only minor differences that can be explained by differences in power management and kernel code.



Well, let’s not get too deep into the controversy territory here. Apple just recently submitted a big patch to Embree that improved the performance by 8%, it got accepted just fine.





The code in question has been written and hand-optimized for Intel‘s SSE and AVX2. The ARM version emulates the semantics of x86 SIMD by implementing x86 platform intrinsics via ARM SIMD instructions. For some simpler cases, there is a 1 to 1 mapping. For other cases the ARM code is suboptimal (because ARM does not natively implement all of Intels SIMD patterns).

This development approach in itself precludes proper optimizations for ARM processors because it assumes the x86 mental model. The primary target is x86, ARM is an afterthought, and the optimization opportunities are limited by the original code which is designed with an x86 processor in mind.

Let me be perfectly clear about one thing though. I do not believe that the current Apple Silicon can outperform x86 designs in this particular domain. As I wrote before, M1 SIMD designed for flexibility of use and not for maximal throughput. M1 simply lacks the cache bandwidth and the clock speed to be a throughput monster. Apple could rewrite Embree from scratch using hand-optimized ARM code and they would still likely only gain another 10, maybe 15% at best. x86 would still be ahead in this type of workload simply due to their hardware advantage. And that’s perfectly fine. Intel pays a huge power usage price to be a performance leader in this niche (it needs approx two to three times more power to produce the same result as Apple). And Apples solution to this class of problems is the GPU, not the CPU.
I saw your comment about Embree design being SSE/AVX2 centric which prevents it from achieving proper optimization with ARM SIMD in another thread and thought that this claim might need some clarification. I am not familiar with these instruction sets specifically, and, while, in general, I understand the premise of your statement, I just wanted to mention the fact that if the two instruction sets were designed to address the same problem, they may have a lot in common. In this case, porting the software to ARM, even without complete re-architecture may produce optimal enough results. But now that I see your estimate of potential 10-15% improvement as a result of deep redesign I think our positions are actually pretty close (and, again, I come from a much less informed position on this specific issue). Thanks for your informative posts.
 
As for "worse", that's very tricky. Generally speaking I've heard good things about the design of Arm's Neon SIMD, and lots of grumbles about SSE (less so about AVX), but I haven't personally coded for any of these so I can't speak from experience.
1673171244460.png

Slide 17 of the presentation "RISC-V Vectors and oneAPI: Accelerating the Future of Heterogeneous Compute"
 
I saw your comment about Embree design being SSE/AVX2 centric which prevents it from achieving proper optimization with ARM SIMD in another thread and thought that this claim might need some clarification. I am not familiar with these instruction sets specifically, and, while, in general, I understand the premise of your statement, I just wanted to mention the fact that if the two instruction sets were designed to address the same problem, they may have a lot in common.

I can tell you a personal anecdote about this. A while ago I was writing a set of SIMD-optimized routines for solving geometric problems. Intel SIMD has a very nifty little instruction that allow you to create a bitmask for all set elements in a SIMD vector. This is widely used in x86 for all kinds of stuff. I heavily relied on this instruction to test different geometric conditions (e.g. I would use SIMD operations to compare multiple coordinate components at once and then inspect the mask to see which configuration I am dealing with). This was a nice set of logical puzzles and worked very well in practice.

When I got my M1 machine and wanted to port my code I suddenly had a problem. ARM SIMD does not offer this type of instruction at all. Emulating this behavior on ARM requires a sequence of instructions and constant loads — ugly and inefficient. At first I got upset and thought that ARM SIMD is simply crap. Then I spent some time reading the documentation. I turns out ARM offers a comprehensive set of fast horizontal reduction operations. I rewrote my algorithms using these operations and the resulting code was simpler, more straightforward and also required fewer instructions than the x86 version. In another instance using ARM–only versions of some operations I could entirely eliminate checks for pathological cases (division by zero) which removed a lot of overhead from the original algorithm (it had to be specially handled by x86 but with ARM this can be handled implicitly).

In the end, different systems give you different tools and some problems need to be approached very differently to make the best use of these tools. Simple computations follow the same logic, but once you have more complex problem the optimal solution can look very differently. Not to mention that architectural concerns can play a big role. In case of Embree the code assumes that there are few wide SIMD units. It doesn’t attempt to offer a high degree of ILP because that would be useless on a x86 CPU. But Apple has more execution units – it needs more ILP for performance. And that’s what you see when looking at hardware counters running CB23 on Apple Silicon: low power usage, low ILP, low hardware utilization.
 
Last edited:
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.