Primate Labs Launches Geekbench 6 Benchmarking Suite

Juraj22 · Feb 15, 2023

Really weird results under geekbench 6. So if M1 Max single core is 2400 under GB6 (that is bigger than 1770 under GB5), how the hell multicore is about the same or lower than from GB5? Also, CPU utilisation in multicore is not running all cores at full capacity. Power consumtion for single core tests is under 4W most of the time, and for multicore under 13W, only multicore test that is capable to hit CPU harder is clang, where it spikes to 33W. Am I missing something?

chucker23n1 · Feb 15, 2023

Juraj22 said:
how the hell multicore is about the same or lower than from GB5? Also, CPU utilisation in multicore is not running all cores at full capacity.

“The multi-core benchmark tests in Geekbench 6 have also undergone a significant overhaul. Rather than assigning separate tasks to each core, the tests now measure how cores cooperate to complete a shared task. This approach improves the relevance of the multi-core tests and is better suited to measuring heterogeneous core performance. This approach follows the growing trend of incorporating “performance” and “efficient” cores in desktops and laptops (not just smartphones and tablets).”

leman · Feb 15, 2023

Juraj22 said:
Really weird results under geekbench 6. So if M1 Max single core is 2400 under GB6 (that is bigger than 1770 under GB5), how the hell multicore is about the same or lower than from GB5? Also, CPU utilisation in multicore is not running all cores at full capacity. Power consumtion for single core tests is under 4W most of the time, and for multicore under 13W, only multicore test that is capable to hit CPU harder is clang, where it spikes to 33W. Am I missing something?

GB6 changed how they do multicore tests. Instead of running N copies of the same single-core task in parallel they run one copy of a task that uses multipel cores to solve the problem. This is more representative to how multiple cores are used on the desktop. Of course, since this introduces dependencies and synchronisation between tasks running in parallel, this will scale worse than the naive "run same thing N times" approach.

Regarding the power consumption you observe, there can be multiple reasons. With the more complex real-world scaling some CPU cores will spend some time waiting, which will reduce their power consumption. Also, the power measures you see are aggregates over certain amount of time and might be averaging things in a werd way. It is also possible that the code used by GB6 is suboptimal (but I suppose it has been checked before the public release). One would look at it in more detail to understand what's really happening. Regardless, I think the results are fairly reasonable overall.

walkin · Feb 15, 2023

ikir said:
No Apple Silicon destroy old intel, this new version just works better on apple GPU, we all know Geekbench 5 was not able to use full potential

As I said, that doesn't track entirely with my experience. Only partially.

[EDIT: I mean that "Apple Silicon destroy (sic) old intel" doesn't entirely track with my experience - but the M1 GPU does seem to perform similarly in my usage to a single FirePro D700, so I'll grant that, at least]

leman · Feb 15, 2023

walkin said:
IMO the value of this benchmark for creative users has become even more dubious than it already was. I have a Mac Pro 2013 12 core that, in most tasks (especially 4K video editing & colour grading), is comparable with my Mac mini M1, and in some tasks outperforms it by a decent margin. The exception is for something like hardware accelerated H.265 encoding where the Mini does exceptionally well. But this benchmark skews the results drastically in favour of the M1. In fact, even my iPhone 12 mini gets multi-core results that are almost as good as the 12 core Mac Pro, which strikes me as absurd. The Metal scores have also skewed in favour of the new hardware. I think Geekbench 4 reflected reality for purposes like mine much better. Maybe not for machine learning or H.265 encoding, but for run of the mill creative work.

My scores for reference (single-core/multi-core/Metal):

iPhone 12 mini: 1999/4203/16149
MBP 15" 2017 3.1GHz w Radeon Pro 560: 1200/3996/20002
Mac Pro 12 core 2.7 w dual FirePro D700: 638/4682/29152 (single GPU)
Mac mini M1: 2320/8296/32313

GB estimates general system performance using a battery of typical workloads. It will never be as representative for a specific use case as a more narrow-purpose benchmark. For creative work, I'd look at pugetbench.

walkin · Feb 15, 2023

leman said:
GB estimates general system performance using a battery of typical workloads. It will never be as representative for a specific use case as a more narrow-purpose benchmark. For creative work, I'd look at pugetbench.

Yeah, that one is probably better for my purposes as well as for many other creative pros.

By the way, my claim is not that Apple Silicon isn't impressive. (That should be obvious.) An entry level machine the size of a lunch plate that's able to keep pace with a machine that was Apple's top performance offering until a few years ago—at a fraction of the power usage and heat production—is no small feat. It just doesn't double the performance of a workstation class computer that uses far more power, in real world tasks that were and are still extremely common.

That shouldn't scandalize anyone, even the most ardent Apple fanboys.

It's just a bit disorienting for an ostensibly neutral, objective benchmark to move the goalposts all of sudden, in favour of new hardware. Of course, no synthetic test can be neutral or objective, since they all involve judgment about what counts as an important measure of performance.

chucker23n1 · Feb 15, 2023

walkin said:
It's just a bit disorienting for an ostensibly neutral, objective benchmark to move the goalposts all of sudden, in favour of new hardware.

I think it makes sense. Heterogeneous cores are here to stay, and making comparisons between them more accurate is an important goal.

walkin said:
Of course, no synthetic test can be neutral or objective, since they all involve judgment about what counts as an important measure of performance.

Ironically, the goal is to make it less artificial this way.

walkin · Feb 15, 2023

In other news, I can't get the new version to show me the CPU score on my NUC 12 i9 12900 (running Windows 11). It either hangs at the end, or completes without opening a browser window. The GPU compute test works as expected, though.

leman · Feb 15, 2023

walkin said:
That shouldn't scandalize anyone, even the most ardent Apple fanboys.

Of course not, it's very reasonable. If it scandalises someone, well, that's their problem.

walkin said:
It's just a bit disorienting for an ostensibly neutral, objective benchmark to move the goalposts all of sudden, in favour of new hardware. Of course, no synthetic test can be neutral or objective, since they all involve judgment about what counts as an important measure of performance.

I don't think it does? In fact, changes to how multicore performance is estimated will reduce the scores achieved on some popular new hardware relative to the previous versions of the benchmarks. The recent years saw the trend of adding a large amount of lower-performing cores as a way to boost multicore performance — this is the case both on desktop (Intel since Alder Lake) and mobile (current Android phones). Such designs perform very well in naive multicore benchmarks, where each core runs its own task and no coordination between cores is required, but will quickly runs in scaling issues with many real applications.

Regarding your Mac Pro, well, it's indeed much slower than modern systems. That you still get good performance out of it for your workload simply means that your workload is not that CPU-heavy. D700 was a massive GPU back in the day, and two of them still pack a punch. Content creation generally uses fairly simple GPU programs, where the raw performance and VRAM bandwidth of your computer is indeed expected to outperform the much more constrained M1 system. At the same time, I have no trouble believing that D700 will struggle with some of the more complex GPU programs, as newer GPUs are often able to execute non-trivial GPU code more efficiently. Plus, Tahiti drivers are probably quite old by now and not well optimised for modern GPGPU code.

Overall, it's not that the benchmarks mischaracterises the capability of your hardware, it's just that content creation (of the particular kind you car about at least) might be less sensitive to some factors and more sensitive to others. Of course, I can easily imagine that this can be confusing for a user who is simply trying to find objective information.

dgdosen · Feb 15, 2023

A couple of observations from perusing through v6 results:

- A lot of people tested their iPhone 14 pro
- Older systems are quickly outclassed
- Newest AMD/Intel hitting above 3,000 single/20,000 multi and scoring much better than latest Apple chips - but those seem to be desktops
- In the grand scheme of things, Apple Silicon is in a tighter range of results than Intel/AMD

I'm eager to see how Apple's 3nm offerings fare.

D-ConYT · Feb 15, 2023

hmmm, i did just get a new iphone so i wonder how fast it'll be on the new tests...

obviouslogic · Feb 15, 2023

Juraj22 said:
Really weird results under geekbench 6. So if M1 Max single core is 2400 under GB6 (that is bigger than 1770 under GB5), how the hell multicore is about the same or lower than from GB5? Also, CPU utilisation in multicore is not running all cores at full capacity. Power consumtion for single core tests is under 4W most of the time, and for multicore under 13W, only multicore test that is capable to hit CPU harder is clang, where it spikes to 33W. Am I missing something?

Not the same baseline scores or the same tests. Stop trying to make sense of discrepancies across GB versions. The i7 used in GB6 is much more capable at multi-core than the i3 used in GB5, so multi-core scores may be relatively "lower". And GB6 multi-core tests have changed to reflect the modern use of multiple core types in a CPU; performance andeffeciency cores (and some Snapdragons have a third; a single "power" core to boost their single core scores).

And stop looking at power consumption. Some tests may not need the CPUs to run at full throttle. They are not checking how fast the CPU can run, it's testing to see how long a specific task takes to complete. And remember these SoCs are designed for efficiency, so the system will always only run a core at the highest frequency required to get it completed. The system may know that running a core at 3.2GHz is not going to get the task done any sooner than running the core at 1GHz.

Juraj22 · Feb 15, 2023

obviouslogic said:
Not the same baseline scores or the same tests. Stop trying to make sense of discrepancies across GB versions. The i7 used in GB6 is much more capable at multi-core than the i3 used in GB5, so multi-core scores may be relatively "lower". And GB6 multi-core tests have changed to reflect the modern use of multiple core types in a CPU.

And stop looking at power consumption. Some tests may not need the CPUs to run at full throttle. They are not checking how fast the CPU can run, it's testing to see how long a specific task takes to complete. And remember these SoCs are designed for efficiency, so the system will always only run a core at the highest frequency required to get it completed. The system may know that running a core at 3.2GHz is not going to get the task done any sooner than running the core at 1GHz.

well, right. So now we measure how well the code is made in GB and runs on particular platform. If CPU is not utilized fully, it does not need to run faster, because there is no need to. Power management will scale CPU frequencies down. Same apply to GPU, if the load is not heavy and long enough, GPU will not scale frequencies up, because task is already done. Such tests are measuring power scaling, and have "I don't know what" meaning.

The only test that is good, is clang. it seems that this particular one can hit the CPU hard enough to actually measure something. Others, meh.

GPU tests, same story, too short. What is the point to measure GPU speed when test is too short and GPU had chance to scale to 400MHz max? 15W max. During the game "Resident Evel - Village of Shadows" it works at 1200Mhz. What is the point?

leman · Feb 15, 2023

dgdosen said:
- Newest AMD/Intel hitting above 3,000 single/20,000 multi and scoring much better than latest Apple chips - but those seem to be desktops

Which really puts things in perspective. The cost x86 desktops pay to be 15% faster than the base M2 is enormous. We are talking about almost 10x difference in power consumption!

Juraj22 said:
well, right. So now we measure how well the code is made in GB and runs on particular platform. If CPU is not utilized fully, it does not need to run faster, because there is no need to. Power management will scale CPU frequencies down. Same apply to GPU, if the load is not heavy and long enough, GPU will not scale frequencies up, because task is already done. Such tests are measuring power scaling, and have "I don't know what" meaning.

The only test that is good, is clang. it seems that this particular one can hit the CPU hard enough to actually measure something. Others, meh.

GPU tests, same story, too short. What is the point to measure GPU speed when test is too short and GPU had chance to scale to 400MHz max? 15W max. During the game "Resident Evel - Village of Shadows" it works at 1200Mhz. What is the point?

What you see with powermetrics are aggregates over certain intervals. What is important is the power usage/clock residency during the actual testing and that is much more difficult to see. You'd need to sample at a much higher rate and analyse the data quantitatively.

At any rate, with GB5 compute my M1 Max was around 50% slower than the desktop RTX 3060. With GB6 it's roughly the same. As the compute capability of these GPUs is fairly similar (13 vs. 10 TFLOPS) I'd say that GB6 produces a result that makes sense to me.

Nick_P · Feb 15, 2023

Do you guys remember the benchmarks Jobs used to quote in the G4/5 days? I forgot the name now…. Are those still used by anyone?

obviouslogic · Feb 15, 2023

Juraj22 said:
well, right. So now we measure how well the code is made in GB and runs on particular platform. If CPU is not utilized fully, it does not need to run faster, because there is no need to. Power management will scale CPU frequencies down. Same apply to GPU, if the load is not heavy and long enough, GPU will not scale frequencies up, because task is already done. Such tests are measuring power scaling, and have "I don't know what" meaning.

The only test that is good, is clang. it seems that this particular one can hit the CPU hard enough to actually measure something. Others, meh.

GPU tests, same story, too short. What is the point to measure GPU speed when test is too short and GPU had chance to scale to 400MHz max? 15W max. During the game "Resident Evel - Village of Shadows" it works at 1200Mhz. What is the point?

It is NOT measuring processor speed… it is measuring the amount of time it takes to perform specific tasks.

If I want to benchmark how long it takes a processor to add one plus one, all I want to know is how long THAT takes on each processor I run the benchmark. Then compare those times. The maximum speed of a processor is inconsequential to my benchmark.

If you want to compare absolute speed of a processing unit, find a benchmark that is meant for that. Geekbench is not that benchmark. This is general purpose benchmarking suite, that tests tasks that an average user might want to perform. The scores are relative to how well an i7-12700 performed those same tasks. It has nothing to do with testing the maximum performance of any processor.

chucker23n1 · Feb 15, 2023

Juraj22 said:
The only test that is good, is clang. it seems that this particular one can hit the CPU hard enough to actually measure something.

Most people do more on their computer than run a compiler all day. Even most developers don’t generally have it continuously compiling stuff.

SpotOnT · Feb 15, 2023

Running tests on my iPhone 8 and I am getting metal scores ranging from 3500-5200.....

Not seeing that kind of swing running Geekbench 5. Well not more than maybe 30-50 points anyway.

JordanNZ · Feb 15, 2023

avkills said:
Sure but that does not equate to real world usage then. I just offered a real world usage test that pegged all my CPU cores to 100% for like 7 minutes.

This is why I pretty much take any Geekbench results very lightly. Running quick little snippets of things that get done in the real world does not give a super accurate look at actual performance. Does not even take in account any throttling that might happen because the CPU or GPU is getting hammered for a long period of time.

And please do not take this post as a bashing of Apple Silicon, Apple has done an amazing job so far with it, the power/performance is very very good.

I would trust Cinebench more for CPU since I know it is going to hammer the CPU for a long time period.

This is more or less just an observation with what happens during the test. But I am still curious as to whether Apple Silicon behaves in the same way as Intel hardware, under Mac OS, using this benchmark suite.

Cinebench has particularly bad core utilisation on Apple silicon.

name99 · Feb 15, 2023

walkin said:
Yeah, that one is probably better for my purposes as well as for many other creative pros.

By the way, my claim is not that Apple Silicon isn't impressive. (That should be obvious.) An entry level machine the size of a lunch plate that's able to keep pace with a machine that was Apple's top performance offering until a few years ago—at a fraction of the power usage and heat production—is no small feat. It just doesn't double the performance of a workstation class computer that uses far more power, in real world tasks that were and are still extremely common.

That shouldn't scandalize anyone, even the most ardent Apple fanboys.

It's just a bit disorienting for an ostensibly neutral, objective benchmark to move the goalposts all of sudden, in favour of new hardware. Of course, no synthetic test can be neutral or objective, since they all involve judgment about what counts as an important measure of performance.

Honestly I think it's great.
The existing common "parallel" benchmarks, whether Cinebench, GB5, or SPEC_RATE all don't provide much useful info (mainly they tell you a derating factor for multi-core on account of memory bandwidth and thermals, and the extent to which that factor is much larger for x86 than Apple).

GB6 is engaging in some realworld measurement of the cost of things like OS overhead and locks, ie the cost of communicating and synchronizing between cores. This is an essential part of most desktop and mobile parallelism (unlike the server "run multiple unrelated VMs" case which is not the space GB6 is really interested in anyway).

The one thing I think would be interesting to see in GB (maybe GB7...) is some testing of VM overhead. Maybe that's the way to do it -- have a third "VM Core" tab which fires up some number of VMs and runs some number of non-interacting code instances (like GB5) in each VM...

name99 · Feb 15, 2023

dgdosen said:
A couple of observations from perusing through v6 results:

- A lot of people tested their iPhone 14 pro
- Older systems are quickly outclassed
- Newest AMD/Intel hitting above 3,000 single/20,000 multi and scoring much better than latest Apple chips - but those seem to be desktops
- In the grand scheme of things, Apple Silicon is in a tighter range of results than Intel/AMD

I'm eager to see how Apple's 3nm offerings fare.

I wouldn't characterize things this way.
My quick summary ( see eg https://www.computerbase.de/2023-02/geekbench-6-die-neue-benchmark-suite-im-leser-benchmark/ ) is

- The fastest available x86 cores are running single-threaded at around 10 to 20% faster than M2.

- The fastest available x86 multi-core designs are running at the same about 10 to 20% faster than M2 Pro/Max in the same sort of price range. So eg the closest match to M2 Pro/Max is an Intel i5-1300 design with 6(*2)+8 cores. That's against Apple with 8+4 cores. Apple wins on more "big" cores, but Intel has SMT and more E-cores which are substantially more powerful than Apple's E-cores. Yet overall we have a draw. To get much better than Apple you need to go to 16(*2) P cores (AMD) or 8(*2)+16 cores (Intel).

I think this is a very good showing on the Apple side, especially since, as I keep trying to point out, the A15 (and thus the M2) were not designed to be great performance improvements on the A14/M1, they were designed to be substantially more energy efficient. The performance improvements will come with the N3 design, both the expected process improvements and the expected IPC improvements. There's just no way Apple can't get more than 20% IPC boost (ie matching the best current x86 single-threaded)! I mean I can (and have) listed a number of practical improvements to this end.
(The reason the schedule was derailed, giving us the apparently boring and disappointing A16, is simply covid which leads to delayed N3. The good times are returning!)

I'm honestly more curious these days about how GPU will play out. I've now investigated enough GPU details to conclude that Apple is on a path there very much like the CPU path. The starting point was somewhat different (more like a traditional GPU, probably because they started with Imagination IP) but over time the standard Apple package of throwing intelligence, not just raw frequency or MOAR GPU CORES! to the problem, has been applied. The most recent public GPU patents (around 2020, 2021) show some really cool ideas that, while they will not show up in basic dense linear algebra tests, should substantially speed up both real graphics and many real-world GPGPU compute use cases (graphs, sparse linear algebra, FFT, stuff like that).

I'd like to see more cross comparison of Apple GPU results against other GPUs, but I have not yet found such a resource on the net. Is there a secret GB6 browser site that's not yet visible to google search?

zapmymac · Feb 15, 2023

Since we need as many tests as possible for the new version I ran it on my 2018 iPad Pro & 13 Pro Max!

Group effort! 🥳

Bustycat · Feb 15, 2023

SpotOnT said:
Running tests on my iPhone 8 and I am getting metal scores ranging from 3500-5200.....

Not seeing that kind of swing running Geekbench 5. Well not more than maybe 30-50 points anyway.

Geekbench 6 generates heat significantly on older devices. Make sure if your device has cooled down before another test.

Bustycat · Feb 15, 2023

Nick_P said:
Do you guys remember the benchmarks Jobs used to quote in the G4/5 days? I forgot the name now…. Are those still used by anyone?

No reliable benchmarks during these PowerPC days is the reason why Geekbench was created.

SpotOnT · Feb 15, 2023

Bustycat said:
Geekbench 6 generates heat significantly on older devices. Make sure if your device has cooled down before another test.

Ya that was my first thought, but the highest wasn't even the first run.

Also tried turned it off. Waited 10 min before powering back on. Running it in Airplane Mode.

Still seeing wild swings with GB6, while the numbers are very consistent in GB5.

Primate Labs Launches Geekbench 6 Benchmarking Suite

macrumors regular

macrumors G3

macrumors Core

macrumors newbie

macrumors Core

macrumors newbie

macrumors G3

macrumors newbie

macrumors Core

macrumors 68030

macrumors member

macrumors 6502

macrumors regular

macrumors Core

macrumors regular

macrumors 6502

macrumors G3

macrumors 65816

macrumors 6502a

macrumors 68030

macrumors 68030

macrumors 65816

macrumors 65816

macrumors 65816

macrumors 65816

Our Staff