Stockfish patch to improve performance on M1 by up to 80%.

Sopel · Dec 8, 2021

Mike Boreham said:
This thread is over my head, but as a chess player who uses Stockfish as the analysis engine in HIARCS Chess Explorer on my M1 MBA, is there something actually useful for me here, or is it all about benchmarks?

HIARCS Chess Explorer uses the HIARCS engine by default but I loaded the Stockfish engine using Homebrew.

I switched from HIARCS engine to Stockfish engine in the belief that Stockfish is a more modern, stronger chess engine, not necessarily faster (no idea about that) but more speed would obviously be good if available.

Thanks

Yes, you will see Stockfish running faster, which translates to being able to analyze deeper in the same amount of time. Engines can think indefinitely, so any speedup improves strength. Stockfish is currently the strongest chess engine.

Mike Boreham · Dec 8, 2021

Sopel said:
Yes, you will see Stockfish running faster, which translates to being able to analyze deeper in the same amount of time. Engines can think indefinitely, so any speedup improves strength. Stockfish is currently the strongest chess engine.

Thanks @Sopel . I had a look at the link in your OP and it wasn't obvious how to me how to apply the patch. Will it become available through the usual brew upgrade at some stage?

throAU · Dec 8, 2021

Sopel said:
Optimization work has been ongoing recently and the result is this PR, which some report to provide up to 80% performance increase (one person reports 32%, the other 80-85%. The results from the latter person presented below).

I got some numbers from M1 that are comparable with other numbers at http://ipmanchess.yolasite.com/amd--intel-chess-bench-stockfish.php. This test is designed to measure mostly the network performance (with hybrid evaluation disabled, only NNUE evaluation is used).

With original Stockfish 14.1 (number of nodes differs between runs because multithreaded search is not deterministic)

Code:

4 threads: Total time (ms) : 299585 Nodes searched : 1073196201 Nodes/second : 3582276 8 threads: Total time (ms) : 305158 Nodes searched : 1657744757 Nodes/second : 5432414

With optimizations cherry-picked on top of Stockfish 14.1

Code:

4 threads: Total time (ms) : 140385 Nodes searched : 931041959 Nodes/second : 6632061 8 threads: Total time (ms) : 174221 Nodes searched : 1714777303 Nodes/second : 9842540

This places the M1 to be competitive with i7-6700k, but still measurably worse than Intel Core i7 11800H and far behind AMD Ryzen 9 4900H.

BUT YOU CAN'T MAKE THAT UP WITH OPTIMISATION

literally 1 week later... no doubt after going after just the low hanging fruit...

paging @Leifi

Sopel · Dec 8, 2021

Mike Boreham said:
Thanks @Sopel . I had a look at the link in your OP and it wasn't obvious how to me how to apply the patch. Will it become available through the usual brew upgrade at some stage?

It was already merged into master. I don't know how brew works, I guess it will become available when brew maintainers put it there?

Taz Mangus · Dec 8, 2021

Sopel said:
Yes, you will see Stockfish running faster, which translates to being able to analyze deeper in the same amount of time. Engines can think indefinitely, so any speedup improves strength. Stockfish is currently the strongest chess engine.

You kept an open mind and attacked the code. I think that is great that you stuck with it and the results bear fruit. It made no sense to make a claim based on false evidence. It is good that has been corrected now.

Are you saying that the other chess engines run even worse on Apple Silicon M1?

Sopel · Dec 8, 2021

Taz Mangus said:
Are you saying that the other chess engines run even worse on Apple Silicon M1?

I'm saying that Stockfish is the best chess engine available regardless of hardware (modulo extreme GPU to CPU ratio where Lc0 is better). By a large margin. But yes, I don't know any other engine that is as optimized for ARM NEON as Stockfish is, though for older engines that didn't use neural networks there was not much that could have been optimized, so they would run just fine, they are just bad.

ingambe · Dec 8, 2021

Sopel said:
I'm saying that Stockfish is the best chess engine available regardless of hardware (modulo extreme GPU to CPU ratio where Lc0 is better). By a large margin. But yes, I don't know any other engine that is as optimized for ARM NEON as Stockfish is, though for older engines that didn't use neural networks there was not much that could have been optimized, so they would run just fine, they are just bad.

What is the performance of Leela chess zero compared to stockfish?

PS: thanks for your patch

Sopel · Dec 8, 2021

ingambe said:
What is the performance of Leela chess zero compared to stockfish?

PS: thanks for your patch

Lc0 is not meant to be run on CPU and they use vastly different network and search, so the performance of them is not comparable in any meaningful way. Strength-wise it really depends on hardware, https://www.sp-cc.de/nn-vs-sf-testing.htm has some results for "i7-8750H 2.6GHz (Hexacore) Notebook, RTX 2060 GPU, Windows 10 64bit, 16GB RAM" hardware. Note that Stockfish 13 is quite a bit behind the current version, possibly around 40-50 elo with these settings. I'd say that Stockfish on a modern 4 core CPU might be equal to Lc0 on RTX 3080. Stockfish can also utilize much more hardware than Lc0 can. There's also Ceres https://github.com/dje-dev/Ceres which is a rewrite of Lc0 to try an mitigate some of the scaling issues but it's still not perfect. Stockfish can scale to tens of thousands of cores (cluster builds) whereas Lc0 has issues with >2 high-end GPUs

ingambe · Dec 8, 2021

Sopel said:
Lc0 is not meant to be run on CPU and they use vastly different network and search, so the performance of them is not comparable in any meaningful way. Strength-wise it really depends on hardware, https://www.sp-cc.de/nn-vs-sf-testing.htm has some results for "i7-8750H 2.6GHz (Hexacore) Notebook, RTX 2060 GPU, Windows 10 64bit, 16GB RAM" hardware. Note that Stockfish 13 is quite a bit behind the current version, possibly around 40-50 elo with these settings. I'd say that Stockfish on a modern 4 core CPU might be equal to Lc0 on RTX 3080. Stockfish can also utilize much more hardware than Lc0 can. There's also Ceres https://github.com/dje-dev/Ceres which is a rewrite of Lc0 to try an mitigate some of the scaling issues but it's still not perfect. Stockfish can scale to tens of thousands of cores (cluster builds) whereas Lc0 has issues with >2 high-end GPUs

Thanks for the insight! Very interesting

dugbug · Dec 8, 2021

Yeah thank god for real engineers like Sopel caring enough to investigate

u_int16 · Dec 8, 2021

Sopel said:
Hmm, this makes sense indeed. Could give ~2-3% more speed. I'll try this later. Thanks.

If you have a link to testing methods and what is expected of the test system, id be happy to run these for you on a 14’ max.

Krevnik · Dec 8, 2021

leman said:
Looks good! If I may ask, how did you build stockfish? Did you use profile-guided build?

I followed the instructions in the README.md file for the GitHub repo, so it’s using the default build settings. I have enough random side projects of my own to deal with that I don’t have a ton of time to ramp up on someone else’s project at the moment.

ingambe said:
Exactly,
That’s why I was very surprised by the performance of the M1 in the original thread
The M1 has a huge cache and a very high memory bandwidth, these two things should help it a lot in an Alpha beta search where a lot of the performance can be killed by cache miss
So, either it’s a software problem or M1 scheduler is particularly bad for such problem

I guess my musing here is about AVX2/etc vs NEON. While M1 is clearly a strong chip, a lot of benchmarks and real world apps aren’t heavy on SIMD. And I haven’t seen a lot of discussion between the two SIMD implementations to get an idea how close they are in reality. So I’m not sure how much impact things like AVX2 being 256-bit versus NEON being 128-bit play in. How much would having SVE improve things on the ARM side? Would leveraging AMX2 on M1 via the Accelerate framework help at all (it can be ~2x faster than NEON in certain cases)? Does M1 have a similar number of SIMD units per core as modern Intel? Things I just don’t honestly know enough about that would have an impact on the final results here.

Add to that differences in the maturity of the SIMD implementations for ARM and x64 in stockfish (and efforts trying to squeeze out more performance), and any differences there can be magnified quite a bit.

leman · Dec 8, 2021

Krevnik said:
I followed the instructions in the README.md file for the GitHub repo, so it’s using the default build settings. I have enough random side projects of my own to deal with that I don’t have a ton of time to ramp up on someone else’s project at the moment.

Well, should you have some time to run the benchmark again please use

make -j profile-build ARCH=apple-silicon

I have seen dramatic performance improvements on profile guided builds on x86.

jeanlain · Dec 8, 2021

Sopel said:
This places the M1 to be competitive with i7-6700k, but still measurably worse than Intel Core i7 11800H and far behind AMD Ryzen 9 4900H.

No unexpected given that these have twice the number of high-performance cores.

leman · Dec 8, 2021

Krevnik said:
I guess my musing here is about AVX2/etc vs NEON. While M1 is clearly a strong chip, a lot of benchmarks and real world apps aren’t heavy on SIMD. And I haven’t seen a lot of discussion between the two SIMD implementations to get an idea how close they are in reality. So I’m not sure how much impact things like AVX2 being 256-bit versus NEON being 128-bit play in. How much would having SVE improve things on the ARM side? Would leveraging AMX2 on M1 via the Accelerate framework help at all (it can be ~2x faster than NEON in certain cases)? Does M1 have a similar number of SIMD units per core as modern Intel? Things I just don’t honestly know enough about that would have an impact on the final results here.

This is not a trivial question. Yes, NEON is 128-bit, but Apple Silicon currently offers 4x SIMD units (for 4x 128-bit ops per cycle) while most AVX-2 implementations can do 2x 256-bit ops per cycle, so it's more or less the same. Some Intel CPUs with AVX512 can do more than one 512bit operations per cycle, but it is unclear to me under which conditions.

To make comparisons more difficult though, there is also a difference in clocks. M1 will run at ~2.9-3.0ghz while running heavy SIMD code, while x86 CPUs can clock significantly higher. However, since we are talking about sustained multithreaded thought put, the x86 chip is probably operating close to it's base frequency (which is ~3ghz for modern mobile x86), so we should get comparable thoughtput.

And then... there is the matter of data loads and stores. Modern x86 CPUs can sustain two 256-bit fetches from L1D, while M1 can't. If I understand the available Firestorm info correctly, it will only do two 128-bit loads from L1D per cycle on most data (and it can do an additional 128-bit load if load overlaps with a previous store). This might limit M1 performance on some workloads where the ratio of operations to memory transfers is low. Stockfish could like one of those cases (they are processing large batches of data, but only do few operations per load), so it might be limited by L1D performance - but who knows. At the same time, M1 has much larger caches, so it might have an advantage if you have to process large amount of data with less predictable access patterns (which does not seem to be the case for Stockfish specifically).

Finally, looking at the ISA itself, it's a bit of a mixed bag. There are some patterns in x86 SIMD that NEON can't do directly — where you need a bunch of instructions to do the same thing. And via versa.

Overall, for throughput oriented stream processing style code, I'd expect the Firestorm core to perform similarly — or slightly worse — than a modern AVX2 x86 core, assuming well optimized code for both platforms. On more complex workloads that involve dependent data fetches, non-trivial conditions and high SIMD to data transfer ratio, Firestorm might pull ahead. On less optimized code, M1 should be faster, since it's overall a more flexible architecture — and we can already see it in benchmarks like SPEC that do not employ architecture-specific SIMD optimizations (or maybe where using SIMD would not work in the first place), M1's four independent FP units allow it to execute the code much faster.

Appletoni said:
It was Apples decision not to sell 16 to 20 high-performance cores, which was a bad decision.

Please just go away.

robco74 · Dec 8, 2021

I'm just excited that my machine is no longer a useless paperweight now that it can get some decent scores in Stockfish.

Krevnik · Dec 8, 2021

leman said:
Well, should you have some time to run the benchmark again please use

make -j profile-build ARCH=apple-silicon

I have seen dramatic performance improvements on profile guided builds on x86.

Oddly, it goes up by nearly 1M when just using the Perf cores, but down by ~500k when using the E cores as well:

So, it depends on what you are hoping to accomplish.

Code:

./stockfish bench 1024 8 26 default depth nnue
===========================
Total time (ms) : 77095
Nodes searched  : 1159403293
Nodes/second    : 15038631

/stockfish bench 1024 10 26 default depth nnue
===========================
Total time (ms) : 76914
Nodes searched  : 1192240595
Nodes/second    : 15500956

dugbug · Dec 8, 2021

Krevnik said:
Oddly, it goes up by nearly 1M when just using the Perf cores, but down by ~500k when using the E cores as well

Ive seen that kind of problem on other systems when threads take work chunks and then at the end joined to retain synchronization across steps. Problem is in big.little you now have slower cores so this strategy can mess you up because the join holds everyone back for the slowest core/thread.

-d

mr_roboto · Dec 8, 2021

leman said:
Well, should you have some time to run the benchmark again please use

make -j profile-build ARCH=apple-silicon

I have seen dramatic performance improvements on profile guided builds on x86.

I just tried this on a M1 Max 16". No significant difference in NPS with PGO:


src % make -j build ARCH=apple-silicon
...
src % ./stockfish bench 1024 10 26 default depth nnue
...
Total time (ms) : 82146
Nodes searched  : 1258329891
Nodes/second    : 15318212


src % make -j profile-build ARCH=apple-silicon       
...
src % ./stockfish bench 1024 10 26 default depth nnue
...
Total time (ms) : 89251
Nodes searched  : 1386375040
Nodes/second    : 15533439

Krevnik · Dec 8, 2021

dugbug said:
Ive seen that kind of problem on other systems when threads take work chunks and then at the end joined to retain synchronization across steps. Problem is in big.little you now have slower cores so this strategy can mess you up because the join holds everyone back for the slowest core/thread.

-d

I guess it's interesting that PGO itself can reveal this sort of issue? The fastest results are still P+E with PGO disabled during the build. The P-only PGO results still come in third, but still a respectable 7% improvement over the P-only non-PGO build.

I honestly spend too much time on "front of house" and "maintainability" type engineering problems to have gleaned much experience at this level, sadly. Even though it's the low level stuff that got me into the field in the first place.

leman said:
<snipped for brevity>

Overall, for throughput oriented stream processing style code, I'd expect the Firestorm core to perform similarly — or slightly worse — than a modern AVX2 x86 core, assuming well optimized code for both platforms. On more complex workloads that involve dependent data fetches, non-trivial conditions and high SIMD to data transfer ratio, Firestorm might pull ahead. On less optimized code, M1 should be faster, since it's overall a more flexible architecture — and we can already see it in benchmarks like SPEC that do not employ architecture-specific SIMD optimizations (or maybe where using SIMD would not work in the first place), M1's four independent FP units allow it to execute the code much faster.

This seems to say that the M1 on paper should be closer to trading blows on SIMD code (when core counts are equal in MT situations) than the non-SIMD results would suggest, if I understand you right?

Sopel · Dec 8, 2021

The perf cores of M1 actually do seem very close or on par with high end AMD chips in this workload after the optimization has been done. This makes sense as M1 has about 2x SIMD throughput but 2x smaller SIMD registers. The only difference is that the M1 doesn't have SMT and has quite few cores overall.

mr_roboto · Dec 8, 2021

Sopel said:
The perf cores of M1 actually do seem very close or on par with high end AMD chips in this workload after the optimization has been done. This makes sense as M1 has about 2x SIMD throughput but 2x smaller SIMD registers. The only difference is that the M1 doesn't have SMT and has quite few cores overall.

Do you have any idea why PGO might not be doing much for M1? When I built it, it looked like it was doing PGO things - built a non-PGO executable, ran a benchmark to collect PGO data, re-built presumably with PGO.

leman · Dec 8, 2021

mr_roboto said:
Do you have any idea why PGO might not be doing much for M1? When I built it, it looked like it was doing PGO things - built a non-PGO executable, ran a benchmark to collect PGO data, re-built presumably with PGO.

Two reasons I can think of: LLVM PGO on ARM might be less mature yet, or PGO might be less effective for a wide microarch such as Firestorm.

Thomas Davie · Dec 10, 2021

Can anyone help me to figure out how to play a chess960 game in the version of Deep HCE/Mac? Buyimg and installimg the game was easy but digging through the manual/game I could not/can not figure this out. Or, did I make a mistake in buying the program.

I’m aware that I can install new engines; but the 2 that came with my download…..can’t figure it out (is it not possible?).

thanks for any help/reading.

Tom

Jorbanead · Dec 10, 2021

throAU said:
BUT YOU CAN'T MAKE THAT UP WITH OPTIMISATION
literally 1 week later... no doubt after going after just the low hanging fruit...
paging @Leifi

@Leifi still hasn't commented on this which makes me question if they were ever interested in an honest conversation, or if they just wanted to keep pushing their narrative.

Stockfish patch to improve performance on M1 by up to 80%.

macrumors member

macrumors 601

macrumors G4

macrumors member

macrumors 604

macrumors member

macrumors 6502

macrumors member

macrumors 6502

macrumors 68000

macrumors member

macrumors 601

macrumors Core

macrumors 68020

macrumors Core

macrumors 6502a

macrumors 601

macrumors 68000

macrumors 6502a

macrumors 601

macrumors member

macrumors 6502a

macrumors Core

macrumors 6502a

macrumors 65816

Our Staff