Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

Sopel

macrumors member
Original poster
Nov 30, 2021
41
85
Optimization work has been ongoing recently and the result is this PR, which some report to provide up to 80% performance increase (one person reports 32%, the other 80-85%. The results from the latter person presented below).

I got some numbers from M1 that are comparable with other numbers at http://ipmanchess.yolasite.com/amd--intel-chess-bench-stockfish.php. This test is designed to measure mostly the network performance (with hybrid evaluation disabled, only NNUE evaluation is used).

With original Stockfish 14.1 (number of nodes differs between runs because multithreaded search is not deterministic)
Code:
4 threads:
Total time (ms) : 299585
Nodes searched  : 1073196201
Nodes/second    : 3582276

8 threads:
Total time (ms) : 305158
Nodes searched  : 1657744757
Nodes/second    : 5432414

With optimizations cherry-picked on top of Stockfish 14.1
Code:
4 threads:
Total time (ms) : 140385
Nodes searched  : 931041959
Nodes/second    : 6632061

8 threads:
Total time (ms) : 174221
Nodes searched  : 1714777303
Nodes/second    : 9842540

This places the M1 to be competitive with i7-6700k, but still measurably worse than Intel Core i7 11800H and far behind AMD Ryzen 9 4900H.
 
and they said in other threads there was nothing more to optimize because M1 sucks...

 
The more realistic measurement is ~30-40%. The 80% was probably due to something running in the background for the original 14.1 bench.
 
  • Like
Reactions: Appletoni
Thanks, great work!

This places the M1 to be competitive with i7-6700k, but still measurably worse than Intel Core i7 11800H and far behind AMD Ryzen 9 4900H.

To make it clear: this is the base M1 we are talking about? Not M1 Pro/Max?
 
Yes, this is the 8 core M1.

Great, thanks! These results look much more in line with what we would expect from Apple Silicon.

It would be nice if someone could bench it on M1 Max, there estimates (extrapolated from this data) are at about 14-15M nps.

My M1 16" won't arrive before January unfortunately, otherwise I would have been happy to run a benchmark.


A quick note: in nnue_feature_transformer.h you are producing 64-bit 8x8 output values per iteration (around line 349):

C:
 out[j+i] = vmax_s8(vqmovn_s16(sums[i]), Zero);

You could potentially optimize this by working on two 16x8 values simultaneously while producing a full 16x8 value using this pattern:

C:
 vmax_s8(vqmovn_high_s16(vqmovn_s16(a), b);, Zero16);

(not 100% sure about the argument order).

This would be virtually the same as the _mm_packs_epi16() in the SSE2 path and should be more efficient.

(Sorry that I am not posting on github directly, I still value my privacy unfortunately :D)

Anyway, really great work! Thanks for taking Mac users seriously. I am very happy that the bickering and tensions in the previous thread did not discourage you.
 
You could potentially optimize this by working on two 16x8 values simultaneously while producing a full 16x8 value using this pattern:

C:
 vmax_s8(vqmovn_high_s16(vqmovn_s16(a), b);, Zero16);
Hmm, this makes sense indeed. Could give ~2-3% more speed. I'll try this later. Thanks.
 
Yes, this is the 8 core M1. It would be nice if someone could bench it on M1 Max, there estimates (extrapolated from this data) are at about 14-15M nps.

Here's some numbers from runs I've done building the latest master:

Code:
./stockfish bench 1024 8 26 default depth nnue (Performance Cores)
===========================
Total time (ms) : 69820
Nodes searched  : 985109927
Nodes/second    : 14109279

./stockfish bench 1024 10 26 default depth nnue (P+E Cores)
===========================
Total time (ms) : 76416
Nodes searched  : 1228833288
Nodes/second    : 16080837

./stockfish bench 1024 8 26 default depth (Performance Cores w/o NNUE)
===========================
Total time (ms) : 75088
Nodes searched  : 1404685908
Nodes/second    : 18707195

./stockfish bench 1024 10 26 default depth (P+E Cores w/o NNUE)
===========================
Total time (ms) : 86774
Nodes searched  : 1737632105
Nodes/second    : 20024801

This seems in line with the estimates assuming your tests were using nnue. I'm curious, is nnue expected to be more computationally expensive at this point in time?
 
I'm curious, is nnue expected to be more computationally expensive at this point in time?
It's much more expensive than the old classical evaluation (which is still used for some positions in the default hybrid mode just due to it being faster). The inference code for the network used in Stockfish changed a lot throughout the time as architecture was chaging and optimizations were being done. The current architecture for example is about 2-3x more costly than the one first introduced to Stockfish. HERE for example are benchmarks done for each commit after NNUE was introduced up to 25th of July (and it got slower after that) for the hybrid approach. I'd love to have similar data for a bigger time span and pure NNUE, but it's expensive to carry out.
 
  • Like
Reactions: Appletoni
It's much more expensive than the old classical evaluation (which is still used for some positions in the default hybrid mode just due to it being faster). The inference code for the network used in Stockfish changed a lot throughout the time as architecture was chaging and optimizations were being done. The current architecture for example is about 2-3x more costly than the one first introduced to Stockfish. HERE for example are benchmarks done for each commit after NNUE was introduced up to 25th of July (and it got slower after that) for the hybrid approach. I'd love to have similar data for a bigger time span and pure NNUE, but it's expensive to carry out.

Sure, I just wanted to confirm that what I was seeing was expected, as I don't pay attention to the chess community. Thanks.
 
maybe I'm not looking in the right spot, but I can't find any of the M1 optimized libraries. It seems more like someone is writing code for an x86, optimizing it and then seeing how it runs on ARM. My guess is that would yield poor results.
 
  • Like
Reactions: Tagbert
maybe I'm not looking in the right spot, but I can't find any of the M1 optimized libraries. It seems more like someone is writing code for an x86, optimizing it and then seeing how it runs on ARM. My guess is that would yield poor results.
Due to the unusual architecture of the Stockfish networks we cannot use any solution that's higher level than assembly. The libraries provided by apple are not suitable, we need lower level access.
 
  • Like
Reactions: Appletoni
maybe I'm not looking in the right spot, but I can't find any of the M1 optimized libraries. It seems more like someone is writing code for an x86, optimizing it and then seeing how it runs on ARM. My guess is that would yield poor results.

Well, sure, but this latest patch does improve things tremendously on high-end ARM cores. This is a nature of high-performance software. The suggestion to "just use an optimized library" does not always work. Often, libraries are not flexibel enough or simply do not offer what is required.

To make it clear: this patch is not just Apple Silicon specific, it will also improve performance on any newer ARM core such as X2, V1 etc.
 
It reportedly shows ~5% speed improvement on raspberry pi too. Should be good for phones too, and many people do use Stockfish on their phones.
 
I just want to know… what are people doing that the M1 isn’t enough? We’ve reached a point where without a new reason, i see very little options to max out what we have.
 
It reportedly shows ~5% speed improvement on raspberry pi too. Should be good for phones too, and many people do use Stockfish on their phones.
Forgive me for a rather dumb question, what exactly do you “use” stick fish for? I thought it was a benchmarking thing?

Edit: Ok so running stockfish helps improve the overall engine via distributed computing. Kind of like a fold@home kind of scenario. Pretty neat.
 
No, this is impossible. M1 can’t do chess, and is the slowest processor in the world. At chess and chess-related chess things. Also no optimizations are possible because reasons.
It is impossible until it is not. All the nay saying, without understanding the issue. Core competency.
 
  • Haha
Reactions: Tagbert
This thread is over my head, but as a chess player who uses Stockfish as the analysis engine in HIARCS Chess Explorer on my M1 MBA, is there something actually useful for me here, or is it all about benchmarks?

HIARCS Chess Explorer uses the HIARCS engine by default but I loaded the Stockfish engine using Homebrew.

I switched from HIARCS engine to Stockfish engine in the belief that Stockfish is a more modern, stronger chess engine, not necessarily faster (no idea about that) but more speed would obviously be good if available.

Thanks
 
  • Like
Reactions: Appletoni
Forgive me for a rather dumb question, what exactly do you “use” stick fish for? I thought it was a benchmarking thing?

Edit: Ok so running stockfish helps improve the overall engine via distributed computing. Kind of like a fold@home kind of scenario. Pretty neat.
No, distributed computing has nothing to do with it. It's not a benchmark either, even though it has a benchmark built into it.

Stockfish is a chess engine. There's two main purposes people use it for. One is entering it into computer chess tournaments. Stockfish devs are quite interested in this as it's how they compete against other chess engines and try to improve Stockfish's playing strength. The other main use is as an analysis tool and/or opponent for human players to improve their own play.

The benchmark built into Stockfish is there because chess engines typically (and Stockfish is no exception) work by trying to brute force search through large numbers of possible future moves. They do attempt to prune unpromising branches of the set of possible future moves, so it's not a true brute force search, but it's still something where the computer has to evaluate millions of possible future board states or "nodes" to figure out the best move in the present board state. The key figure in those benchmark results is "nodes/second". Just like humans in serious play, computers are usually given a thinking time limit of some kind, so the more nodes/second that Stockfish can evaluate, the more future board states it gets to look at for each real move, and the stronger its play gets.
 
No, distributed computing has nothing to do with it. It's not a benchmark either, even though it has a benchmark built into it.

Stockfish is a chess engine. There's two main purposes people use it for. One is entering it into computer chess tournaments. Stockfish devs are quite interested in this as it's how they compete against other chess engines and try to improve Stockfish's playing strength. The other main use is as an analysis tool and/or opponent for human players to improve their own play.

The benchmark built into Stockfish is there because chess engines typically (and Stockfish is no exception) work by trying to brute force search through large numbers of possible future moves. They do attempt to prune unpromising branches of the set of possible future moves, so it's not a true brute force search, but it's still something where the computer has to evaluate millions of possible future board states or "nodes" to figure out the best move in the present board state. The key figure in those benchmark results is "nodes/second". Just like humans in serious play, computers are usually given a thinking time limit of some kind, so the more nodes/second that Stockfish can evaluate, the more future board states it gets to look at for each real move, and the stronger its play gets.
Exactly,
That’s why I was very surprised by the performance of the M1 in the original thread
The M1 has a huge cache and a very high memory bandwidth, these two things should help it a lot in an Alpha beta search where a lot of the performance can be killed by cache miss
So, either it’s a software problem or M1 scheduler is particularly bad for such problem
 
Last edited:
So, either it’s a software problem or M1 scheduler is particularly bad for such problem

So far it seems the problem was that x86 code was heavily optimized for high throughput using SIMD operations, while ARM code was not. @Sopel did a great job improving the situation. I am sure that we will see more improvements (albeit probably smaller ones) down the road as new optimizations are attempted and things are further refined.
 
Last edited:
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.