Can Cinebench predict the performance of non-embarrisingly parallel workloads?
Which benchmark can predict better the performance of non-embarrisingly parallel workloads?
In order:
No.
That's a good question, and I do not have an answer at hand.
You looked at GB5. I'm not entirely sure about this, I would have to go through the document GB publishes with details on each test, but I think I remember reading that it follows the SPEC model where tests are single-threaded, and multi-threaded results are generated by running several independent copies at the same time. Which would mean it tests only the embarrassingly parallel case.
Benchmarking is really hard, cross platform benchmarking doubly so! There are vanishingly few public resources who do it well. You need a lot of deep technical knowledge and the will to spend a lot of time exploring why results are the way they are rather than accepting them at immediate face value. If you don't do this, you can easily fool yourself. AnandTech was the last enthusiast site I trusted to at least attempt to do this the right way, and unfortunately the person responsible for their highest quality work in recent times (Andrei F.) left to change careers.
Out of curiosity, how do you know? Was it written in Embree's release notes?
It's written in Embree's source code - it is an open source project, so you can go look.
Speaking of which, to correct a possible wrong impression from my earlier post - I don't think Intel wrote Embree to cheat on benchmarks. It's quite believable that they created and open sourced it simply because it's difficult to write and optimize SIMD code, they employ a bunch of experts who know how to do that, and they decided it would be useful to have those experts create a reference x86 SIMD raytracer for others to crib from. And if it got picked up and used by real-world application software, bonus! It all works out to benefit Intel.
The weird SSE emulation Arm code is someone else's contribution, IIRC.
Is the poor optimization of the ARM SIMDn instructions the only reason why the actual result of M1 Ultra in Cinebench R23 is so bad with respect to the theoretical result? Could the ARM SIMD instructions be worse than the x86 SIMD instructions?
The "emulate SSE" approach is a good way to get things running with mediocre performance on Arm, and a terrible way to optimize for maximum performance on Arm. The current project maintainers might not want to accept a truly separate SIMD codepath for Arm, though, as that increases maintenance burden a lot. With the current approach, whenever they make a bugfix or feature enhancement on the SSE code, it automatically translates to Arm. (I haven't tried to look on the project's public communications spaces to find out whether there's been any debates about that.)
As for "worse", that's very tricky. Generally speaking I've heard good things about the design of Arm's Neon SIMD, and lots of grumbles about SSE (less so about AVX), but I haven't personally coded for any of these so I can't speak from experience.