Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

Sopel

macrumors member
Original poster
Nov 30, 2021
41
85
This thread is over my head, but as a chess player who uses Stockfish as the analysis engine in HIARCS Chess Explorer on my M1 MBA, is there something actually useful for me here, or is it all about benchmarks?

HIARCS Chess Explorer uses the HIARCS engine by default but I loaded the Stockfish engine using Homebrew.

I switched from HIARCS engine to Stockfish engine in the belief that Stockfish is a more modern, stronger chess engine, not necessarily faster (no idea about that) but more speed would obviously be good if available.

Thanks
Yes, you will see Stockfish running faster, which translates to being able to analyze deeper in the same amount of time. Engines can think indefinitely, so any speedup improves strength. Stockfish is currently the strongest chess engine.
 
  • Like
Reactions: Appletoni

Mike Boreham

macrumors 68040
Aug 10, 2006
3,767
1,784
UK
Yes, you will see Stockfish running faster, which translates to being able to analyze deeper in the same amount of time. Engines can think indefinitely, so any speedup improves strength. Stockfish is currently the strongest chess engine.
Thanks @Sopel . I had a look at the link in your OP and it wasn't obvious how to me how to apply the patch. Will it become available through the usual brew upgrade at some stage?
 
  • Like
Reactions: Appletoni

throAU

macrumors G3
Feb 13, 2012
8,944
7,106
Perth, Western Australia
Optimization work has been ongoing recently and the result is this PR, which some report to provide up to 80% performance increase (one person reports 32%, the other 80-85%. The results from the latter person presented below).

I got some numbers from M1 that are comparable with other numbers at http://ipmanchess.yolasite.com/amd--intel-chess-bench-stockfish.php. This test is designed to measure mostly the network performance (with hybrid evaluation disabled, only NNUE evaluation is used).

With original Stockfish 14.1 (number of nodes differs between runs because multithreaded search is not deterministic)
Code:
4 threads:
Total time (ms) : 299585
Nodes searched  : 1073196201
Nodes/second    : 3582276

8 threads:
Total time (ms) : 305158
Nodes searched  : 1657744757
Nodes/second    : 5432414

With optimizations cherry-picked on top of Stockfish 14.1
Code:
4 threads:
Total time (ms) : 140385
Nodes searched  : 931041959
Nodes/second    : 6632061

8 threads:
Total time (ms) : 174221
Nodes searched  : 1714777303
Nodes/second    : 9842540

This places the M1 to be competitive with i7-6700k, but still measurably worse than Intel Core i7 11800H and far behind AMD Ryzen 9 4900H.

BUT YOU CAN'T MAKE THAT UP WITH OPTIMISATION



literally 1 week later... no doubt after going after just the low hanging fruit...


paging @Leifi
 

Taz Mangus

macrumors 604
Mar 10, 2011
7,815
3,504
Yes, you will see Stockfish running faster, which translates to being able to analyze deeper in the same amount of time. Engines can think indefinitely, so any speedup improves strength. Stockfish is currently the strongest chess engine.
You kept an open mind and attacked the code. I think that is great that you stuck with it and the results bear fruit. It made no sense to make a claim based on false evidence. It is good that has been corrected now.

Are you saying that the other chess engines run even worse on Apple Silicon M1?
 
Last edited:
  • Like
Reactions: Tagbert

Sopel

macrumors member
Original poster
Nov 30, 2021
41
85
Are you saying that the other chess engines run even worse on Apple Silicon M1?
I'm saying that Stockfish is the best chess engine available regardless of hardware (modulo extreme GPU to CPU ratio where Lc0 is better). By a large margin. But yes, I don't know any other engine that is as optimized for ARM NEON as Stockfish is, though for older engines that didn't use neural networks there was not much that could have been optimized, so they would run just fine, they are just bad.
 
  • Like
Reactions: Appletoni

ingambe

macrumors 6502
Mar 22, 2020
320
355
I'm saying that Stockfish is the best chess engine available regardless of hardware (modulo extreme GPU to CPU ratio where Lc0 is better). By a large margin. But yes, I don't know any other engine that is as optimized for ARM NEON as Stockfish is, though for older engines that didn't use neural networks there was not much that could have been optimized, so they would run just fine, they are just bad.
What is the performance of Leela chess zero compared to stockfish?

PS: thanks for your patch
 
  • Like
Reactions: Appletoni

Sopel

macrumors member
Original poster
Nov 30, 2021
41
85
What is the performance of Leela chess zero compared to stockfish?

PS: thanks for your patch
Lc0 is not meant to be run on CPU and they use vastly different network and search, so the performance of them is not comparable in any meaningful way. Strength-wise it really depends on hardware, https://www.sp-cc.de/nn-vs-sf-testing.htm has some results for "i7-8750H 2.6GHz (Hexacore) Notebook, RTX 2060 GPU, Windows 10 64bit, 16GB RAM" hardware. Note that Stockfish 13 is quite a bit behind the current version, possibly around 40-50 elo with these settings. I'd say that Stockfish on a modern 4 core CPU might be equal to Lc0 on RTX 3080. Stockfish can also utilize much more hardware than Lc0 can. There's also Ceres https://github.com/dje-dev/Ceres which is a rewrite of Lc0 to try an mitigate some of the scaling issues but it's still not perfect. Stockfish can scale to tens of thousands of cores (cluster builds) whereas Lc0 has issues with >2 high-end GPUs
 

ingambe

macrumors 6502
Mar 22, 2020
320
355
Lc0 is not meant to be run on CPU and they use vastly different network and search, so the performance of them is not comparable in any meaningful way. Strength-wise it really depends on hardware, https://www.sp-cc.de/nn-vs-sf-testing.htm has some results for "i7-8750H 2.6GHz (Hexacore) Notebook, RTX 2060 GPU, Windows 10 64bit, 16GB RAM" hardware. Note that Stockfish 13 is quite a bit behind the current version, possibly around 40-50 elo with these settings. I'd say that Stockfish on a modern 4 core CPU might be equal to Lc0 on RTX 3080. Stockfish can also utilize much more hardware than Lc0 can. There's also Ceres https://github.com/dje-dev/Ceres which is a rewrite of Lc0 to try an mitigate some of the scaling issues but it's still not perfect. Stockfish can scale to tens of thousands of cores (cluster builds) whereas Lc0 has issues with >2 high-end GPUs
Thanks for the insight! Very interesting :)
 
  • Like
Reactions: Appletoni

Krevnik

macrumors 601
Sep 8, 2003
4,100
1,309
Looks good! If I may ask, how did you build stockfish? Did you use profile-guided build?

I followed the instructions in the README.md file for the GitHub repo, so it’s using the default build settings. I have enough random side projects of my own to deal with that I don’t have a ton of time to ramp up on someone else’s project at the moment.

Exactly,
That’s why I was very surprised by the performance of the M1 in the original thread
The M1 has a huge cache and a very high memory bandwidth, these two things should help it a lot in an Alpha beta search where a lot of the performance can be killed by cache miss
So, either it’s a software problem or M1 scheduler is particularly bad for such problem

I guess my musing here is about AVX2/etc vs NEON. While M1 is clearly a strong chip, a lot of benchmarks and real world apps aren’t heavy on SIMD. And I haven’t seen a lot of discussion between the two SIMD implementations to get an idea how close they are in reality. So I’m not sure how much impact things like AVX2 being 256-bit versus NEON being 128-bit play in. How much would having SVE improve things on the ARM side? Would leveraging AMX2 on M1 via the Accelerate framework help at all (it can be ~2x faster than NEON in certain cases)? Does M1 have a similar number of SIMD units per core as modern Intel? Things I just don’t honestly know enough about that would have an impact on the final results here.

Add to that differences in the maturity of the SIMD implementations for ARM and x64 in stockfish (and efforts trying to squeeze out more performance), and any differences there can be magnified quite a bit.
 
Last edited:
  • Like
Reactions: Appletoni

leman

macrumors Core
Oct 14, 2008
19,302
19,284
I followed the instructions in the README.md file for the GitHub repo, so it’s using the default build settings. I have enough random side projects of my own to deal with that I don’t have a ton of time to ramp up on someone else’s project at the moment.

Well, should you have some time to run the benchmark again please use

make -j profile-build ARCH=apple-silicon

I have seen dramatic performance improvements on profile guided builds on x86.
 
  • Like
Reactions: Appletoni

jeanlain

macrumors 68020
Mar 14, 2009
2,440
936
This places the M1 to be competitive with i7-6700k, but still measurably worse than Intel Core i7 11800H and far behind AMD Ryzen 9 4900H.
No unexpected given that these have twice the number of high-performance cores.
 
  • Like
Reactions: JMacHack

leman

macrumors Core
Oct 14, 2008
19,302
19,284
I guess my musing here is about AVX2/etc vs NEON. While M1 is clearly a strong chip, a lot of benchmarks and real world apps aren’t heavy on SIMD. And I haven’t seen a lot of discussion between the two SIMD implementations to get an idea how close they are in reality. So I’m not sure how much impact things like AVX2 being 256-bit versus NEON being 128-bit play in. How much would having SVE improve things on the ARM side? Would leveraging AMX2 on M1 via the Accelerate framework help at all (it can be ~2x faster than NEON in certain cases)? Does M1 have a similar number of SIMD units per core as modern Intel? Things I just don’t honestly know enough about that would have an impact on the final results here.

This is not a trivial question. Yes, NEON is 128-bit, but Apple Silicon currently offers 4x SIMD units (for 4x 128-bit ops per cycle) while most AVX-2 implementations can do 2x 256-bit ops per cycle, so it's more or less the same. Some Intel CPUs with AVX512 can do more than one 512bit operations per cycle, but it is unclear to me under which conditions.

To make comparisons more difficult though, there is also a difference in clocks. M1 will run at ~2.9-3.0ghz while running heavy SIMD code, while x86 CPUs can clock significantly higher. However, since we are talking about sustained multithreaded thought put, the x86 chip is probably operating close to it's base frequency (which is ~3ghz for modern mobile x86), so we should get comparable thoughtput.

And then... there is the matter of data loads and stores. Modern x86 CPUs can sustain two 256-bit fetches from L1D, while M1 can't. If I understand the available Firestorm info correctly, it will only do two 128-bit loads from L1D per cycle on most data (and it can do an additional 128-bit load if load overlaps with a previous store). This might limit M1 performance on some workloads where the ratio of operations to memory transfers is low. Stockfish could like one of those cases (they are processing large batches of data, but only do few operations per load), so it might be limited by L1D performance - but who knows. At the same time, M1 has much larger caches, so it might have an advantage if you have to process large amount of data with less predictable access patterns (which does not seem to be the case for Stockfish specifically).

Finally, looking at the ISA itself, it's a bit of a mixed bag. There are some patterns in x86 SIMD that NEON can't do directly — where you need a bunch of instructions to do the same thing. And via versa.

Overall, for throughput oriented stream processing style code, I'd expect the Firestorm core to perform similarly — or slightly worse — than a modern AVX2 x86 core, assuming well optimized code for both platforms. On more complex workloads that involve dependent data fetches, non-trivial conditions and high SIMD to data transfer ratio, Firestorm might pull ahead. On less optimized code, M1 should be faster, since it's overall a more flexible architecture — and we can already see it in benchmarks like SPEC that do not employ architecture-specific SIMD optimizations (or maybe where using SIMD would not work in the first place), M1's four independent FP units allow it to execute the code much faster.

It was Apples decision not to sell 16 to 20 high-performance cores, which was a bad decision.

Please just go away.
 

Krevnik

macrumors 601
Sep 8, 2003
4,100
1,309
Well, should you have some time to run the benchmark again please use

make -j profile-build ARCH=apple-silicon

I have seen dramatic performance improvements on profile guided builds on x86.

Oddly, it goes up by nearly 1M when just using the Perf cores, but down by ~500k when using the E cores as well:

So, it depends on what you are hoping to accomplish.

Code:
./stockfish bench 1024 8 26 default depth nnue
===========================
Total time (ms) : 77095
Nodes searched  : 1159403293
Nodes/second    : 15038631

/stockfish bench 1024 10 26 default depth nnue
===========================
Total time (ms) : 76914
Nodes searched  : 1192240595
Nodes/second    : 15500956
 
  • Like
Reactions: Appletoni

dugbug

macrumors 68000
Aug 23, 2008
1,869
1,953
Somewhere in Florida
Oddly, it goes up by nearly 1M when just using the Perf cores, but down by ~500k when using the E cores as well

Ive seen that kind of problem on other systems when threads take work chunks and then at the end joined to retain synchronization across steps. Problem is in big.little you now have slower cores so this strategy can mess you up because the join holds everyone back for the slowest core/thread.

-d
 
  • Like
Reactions: Appletoni

mr_roboto

macrumors 6502a
Sep 30, 2020
772
1,652
Well, should you have some time to run the benchmark again please use

make -j profile-build ARCH=apple-silicon

I have seen dramatic performance improvements on profile guided builds on x86.
I just tried this on a M1 Max 16". No significant difference in NPS with PGO:

src % make -j build ARCH=apple-silicon ... src % ./stockfish bench 1024 10 26 default depth nnue ... Total time (ms) : 82146 Nodes searched : 1258329891 Nodes/second : 15318212
src % make -j profile-build ARCH=apple-silicon ... src % ./stockfish bench 1024 10 26 default depth nnue ... Total time (ms) : 89251 Nodes searched : 1386375040 Nodes/second : 15533439
 
  • Like
Reactions: Appletoni

Krevnik

macrumors 601
Sep 8, 2003
4,100
1,309
Ive seen that kind of problem on other systems when threads take work chunks and then at the end joined to retain synchronization across steps. Problem is in big.little you now have slower cores so this strategy can mess you up because the join holds everyone back for the slowest core/thread.

-d

I guess it's interesting that PGO itself can reveal this sort of issue? The fastest results are still P+E with PGO disabled during the build. The P-only PGO results still come in third, but still a respectable 7% improvement over the P-only non-PGO build.

I honestly spend too much time on "front of house" and "maintainability" type engineering problems to have gleaned much experience at this level, sadly. Even though it's the low level stuff that got me into the field in the first place.

<snipped for brevity>

Overall, for throughput oriented stream processing style code, I'd expect the Firestorm core to perform similarly — or slightly worse — than a modern AVX2 x86 core, assuming well optimized code for both platforms. On more complex workloads that involve dependent data fetches, non-trivial conditions and high SIMD to data transfer ratio, Firestorm might pull ahead. On less optimized code, M1 should be faster, since it's overall a more flexible architecture — and we can already see it in benchmarks like SPEC that do not employ architecture-specific SIMD optimizations (or maybe where using SIMD would not work in the first place), M1's four independent FP units allow it to execute the code much faster.

This seems to say that the M1 on paper should be closer to trading blows on SIMD code (when core counts are equal in MT situations) than the non-SIMD results would suggest, if I understand you right?
 
  • Like
Reactions: Appletoni

Sopel

macrumors member
Original poster
Nov 30, 2021
41
85
The perf cores of M1 actually do seem very close or on par with high end AMD chips in this workload after the optimization has been done. This makes sense as M1 has about 2x SIMD throughput but 2x smaller SIMD registers. The only difference is that the M1 doesn't have SMT and has quite few cores overall.
 

mr_roboto

macrumors 6502a
Sep 30, 2020
772
1,652
The perf cores of M1 actually do seem very close or on par with high end AMD chips in this workload after the optimization has been done. This makes sense as M1 has about 2x SIMD throughput but 2x smaller SIMD registers. The only difference is that the M1 doesn't have SMT and has quite few cores overall.
Do you have any idea why PGO might not be doing much for M1? When I built it, it looked like it was doing PGO things - built a non-PGO executable, ran a benchmark to collect PGO data, re-built presumably with PGO.
 

leman

macrumors Core
Oct 14, 2008
19,302
19,284
Do you have any idea why PGO might not be doing much for M1? When I built it, it looked like it was doing PGO things - built a non-PGO executable, ran a benchmark to collect PGO data, re-built presumably with PGO.

Two reasons I can think of: LLVM PGO on ARM might be less mature yet, or PGO might be less effective for a wide microarch such as Firestorm.
 

Thomas Davie

macrumors 6502a
Jan 20, 2004
588
346
Can anyone help me to figure out how to play a chess960 game in the version of Deep HCE/Mac? Buyimg and installimg the game was easy but digging through the manual/game I could not/can not figure this out. Or, did I make a mistake in buying the program.

I’m aware that I can install new engines; but the 2 that came with my download…..can’t figure it out (is it not possible?).

thanks for any help/reading.

Tom
 
  • Like
Reactions: Appletoni

Jorbanead

macrumors 65816
Aug 31, 2018
1,206
1,434
BUT YOU CAN'T MAKE THAT UP WITH OPTIMISATION
literally 1 week later... no doubt after going after just the low hanging fruit...
paging @Leifi
@Leifi still hasn't commented on this which makes me question if they were ever interested in an honest conversation, or if they just wanted to keep pushing their narrative.
 
  • Like
Reactions: throAU
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.