Apple M1 CPU & GPU speed is very disappointing

Leifi · Nov 29, 2021

leman said:
I never claimed I would be able to make it run much faster. I said that someone would need first to do a code analysis to understand where potential issues are. And sure, I could be that someone but I won’t do it for free. I am not a charity and that’s not a project I have personal interest in.

What makes you think no one have done code analyses of Stockfish code? And that there are any specific issues for Apples ARM version.. StockFish has had ARM versions for years... and even special assembler and C versions for ARM ....

"I never claimed I would be able to make it run much faster."

Noted. this is the important fact.

pshufd · Nov 29, 2021

Homy said:
When 4A wanted to optimize Metro Exodus and Larian Studios wanted to optimize Baldur's Gate 3 they turned to Apple for help and Apple was more than happy to help them. That has made BG3 the best optimized native game for Apple Silicon doing around 100 fps at Ultra 1080p on MBP 32-core M1 Max. That's faster than an Alienware X17 with i7 11800H and 140W RTX 3070.

That's the power of optimization. So the important question here is why chess software developers don't ask Apple for help? or do they? People here really should ask the developers all these questions instead of asking random forum members who don't care for chess (including me) to provide proof. Come back here and tell us what the developers said and what answer Apple gave them before this turns to a 100 page long discussion about what M1 can't do and why it sucks at Stockfish.

As someone already said this exact discussion about chess and Stockfish came up when M1 was released in 2020. I guess chess hasn't got more popular since then.

Optimize high-end games for Apple GPUs - WWDC21 - Videos - Apple Developer

Optimize your high-end games for Apple GPUs: We'll show you how you can use our rendering and debugging tools to eliminate performance...

developer.apple.com

This is a video on how to run Think or Swim from TD Ameritrade, natively on Apple Silicon. Performance generally stinks if you use their installer kit as you have interpreted Java code that generates x86 code that's translated by Rosetta 2. Running it on native Apple Silicon Java means that it takes one-third the time for starting up and doing general operations. TD Ameritrade has almost $4 trillion in assets under management and Schwab just bought them out. But they don't have a native Apple Silicon kit for their professional trading platform.

So people who want great performance on Apple Silicon just follow the directions in this video. To earn their living, to make money for their companies or just to manage their 401k accounts. And that's where it is worth it to put some effort into optimization. Because it provides a direct financial payback.

pshufd · Nov 29, 2021

Leifi said:
What makes you think no one have done code analyses of Stockfish code? And that there are any specific issues for Apples ARM version.. StockFish has had ARM versions for years... and even special assembler and C versions for ARM ....

"I never claimed I would be able to make it run much faster."

Have you looked at it yourself? I've done optimizations that nobody has thought to do in the past because I had an intense interest in making something faster. All it takes is one interested party with the right education, background, skillset and tools.

pshufd · Nov 29, 2021

Jorbanead said:
Well it’s a good thing I didn’t buy my Mac to run chess benchmarks! I guess I’ll just have to deal with extremely fast render times and only being able to run 30 streams of 4K all on battery instead… dang this computer sucks

At least you'll have the best screen and speakers to play chess if you want to.

Taz Mangus · Nov 29, 2021

Leifi said:
What makes you think no one have done code analyses of Stockfish code? And that there are any specific issues for Apples ARM version.. StockFish has had ARM versions for years... and even special assembler and C versions for ARM ....

"I never claimed I would be able to make it run much faster."

Noted. this is the important fact.

I love learning something new all the time. I now know there is something called StockFish. If had not known better I would have thought we were discussing fishing licenses.

pshufd · Nov 29, 2021

Performance optimization encompasses a variety of skills but you need an in-depth knowledge of the architecture, and the low-level APIs and the knowledge of the best tools for profiling. Apple Silicon is a lot more than ARM. Have you actually optimized a large application before? This is all pretty basic software engineering stuff.

Taz Mangus · Nov 29, 2021

Leifi said:
According to discussion on talkchess Apple M1 even got beaten on battery/perfromance..

The M1 was able to tto 60 Gig positions of analysis before the battery went from 100% to 0% (battery lasted 3 hours)

An Asus 5900 laptop ran on battery and was finsihed with the same amount of positions 60G after only 1 hour with 50% left of battery.

Laptop Positions Hours Battery left
Macbook M1 60.000.000.000 3h 0%
Asus 5900H 60.000.000.000 1h 50%

Granted the Asus probably was louder and had a larger battery it is still quite sad when the biggest selling point is power-efficiency for the Macs to be less productive and run out of battery with less work done.

So what you are telling us is the the M1 could go 3 hrs but the Asus 5900H could only go 2 hrs on 100% battery. Got it, good to know. Might be the only valuable thing that the chess benchmark is good for on the M1 hardware, battery life test.

Come back when the chess engine code has been fully optimized to take advantage of the M1 hardware such as the machine learning, the neural engine. Until then, it is pretty much a useless benchmark for the M1 hardware.

mr_roboto · Nov 29, 2021

pshufd said:
Have you looked at it yourself? I've done optimizations that nobody has thought to do in the past because I had an intense interest in making something faster. All it takes is one interested party with the right education, background, skillset and tools.

This is the key. Stockfish is an open source chess engine. If none of the regular contributors have a M1 Mac, they're not going to be able to put much (or any) effort into properly optimizing for M1 (*). If nobody outside that group of people has the right combination of interest, background, and access to tools, it's just not going to happen. Such is the nature of open source.

throAU · Nov 29, 2021

mr_roboto said:
This is the key. Stockfish is an open source chess engine. If none of the regular contributors have a M1 Mac, they're not going to be able to put much (or any) effort into properly optimizing for M1 (*). If nobody outside that group of people has the right combination of interest, background, and access to tools, it's just not going to happen. Such is the nature of open source.

Exactly, and I'd wager that the cross over between

chess nerd
m1 owner
fairly professional coder
spare time to devote to chess code optimisation

is exceedingly small

the cross over between chess nerd and linux nerd is probably a lot closer, but even then. you'll likely find the app is primarily intel on windows optimised. because that's the dominant platform.

JouniS · Nov 29, 2021

throAU said:
Exactly, and I'd wager that the cross over between

chess nerd

m1 owner

fairly professional coder

spare time to devote to chess code optimisation

is exceedingly small

The first three groups overlap quite a lot. Professional software developers are far more likely to use a Mac and play chess than the general public. The fourth point is the real problem. Your employer is not going to pay you for contributing to an open source chess engine, so you must do it on your own time.

Then we run into social dynamics of volunteer projects. People in general do not contribute significantly to somebody else's projects. They prefer contributing to something where thay are a part of the core team and have ownership of the project or a part of it. However, most volunteer projects already have exactly as many core members as there is room for. Most would prefer having a larger team, but they can't accommodate any additional people without changing team structure and dynamics, which could cause them to lose existing people.

thenewperson · Nov 29, 2021

robco74 said:
You deride Apple for using proprietery (sic) technology, but then praise CUDA?

I legitimately love how much this happens. Usually in the same sentence too!

leman · Nov 30, 2021

Leifi said:
What makes you think no one have done code analyses of Stockfish code? And that there are any specific issues for Apples ARM version.. StockFish has had ARM versions for years... and even special assembler and C versions for ARM ....

"I never claimed I would be able to make it run much faster."

Noted. this is the important fact.

Sorry, what exactly do you still want from me? You've been vocally complaining that nobody can offer a more technical explanation why having native ARM and Neon is not the same as proper optimization. In your own words:

Leifi said:
Can you be specific please.. What exactly optimizations do you think can be done on an M1 nott already done on stockfish and cFish c compiles with NEON. And how much would you think could gained exactly in theory and in practice..

This is a question everyone avoids here and just tries to sidestep, becuase they have no good answer!

Well, I have explained it to you — in depth — in #550. Your reaction? "but Stockfish has had ARM version for years...!" How is one supposed to talk to you on a professional level if all you do is squirm and wriggle like an eel? I know this kind of behavior from my five year old nephew, if you tell him something he doesn't want to hear he closes his ears and goes "la la la", but dealign with this from an (allegedly) adult person? It's embarrassing, really.

ddhhddhh2 · Nov 30, 2021

Leifi said:
I am not sure what your point is.. So M1s performance is irrelevant because more people knows more about LeBron than they do about the the M1 or chess. And that fashion brands sponsor people who can throw a ball trough a hoop, makes everything that requeires higher IQ less relevant?

Is that what you are trying to argue?

Yes, the cruel truth.

Of course people care about the performance of the M1, coding, 3D, 2D, video processing, music, even surfing and text work.

But obviously, not chess.

I even wonder how many people in this discussion really understand chess, and I do, how many chess engines are there? All chess engines on M1 are not optimized? All of them underperform? let's just say that until you claimed this, almost no one knew.

Let me give you an example, a long, long time ago, when I wanted to do 3D creation, there were a lot of options, but not including the most popular 3D Max Studio, later I chose C4D + SketchUp, both of which worked very well on Macs, from G4 to G5 to intel Macs, and will continue to work very well until the M series.

If I had insisted on learning 3D Max Studio back then, I would have had to consider WinPC instead of macOS.

Although this example is rather extreme, as far as I can see, you find that your beloved chess cannot get the same performance on the M1 as everyone else is experiencing. Of course, you have the right to complain, but I think you have a few options.

1. sponsor money - lots of money for chess optimization, and by the way, don't put that stupid pic again, if you're an angel investor, you have the right to talk, unless you don't believe optimizing will help after all.

2. Return the M1, you have spent so much time on here preaching your arguments, there is enough time for you to return it.

All your arguments make you look like a child, you are not happy with the teddy bear in your hand, but you are holding on to it.

All software that has ever performed poorly on macOS may or may not improve, but if it keep poorly, people usually just blame the software developer, even if it is open source.

However, you're different from everyone else, and your unique and targeted remarks make me think you have an ulterior motive.

Ah yes, I've seen Intel's advertising budget go down, so you have to do the right thing and let the world know that even though most people are happy with the performance and power savings and low temperature that the M series offers, but the M series can't run your favorite chess game well, so the M series is garbage, that's your genius logic.

Checkmate!

ddhhddhh2 · Nov 30, 2021

robco74 said:
You deride Apple for using proprietery (sic) technology, but then praise CUDA? You also complain about comparing performance on battery mode, which is perfectly valid for a laptop comparison? How many more escape hatches are there to get through?

M1 has been out for little more than a year. The fact that chess engines haven't yet been fully optimized isn't earth shattering news. Apple, and others, have put resources toward the areas of greatest need. Chess engines aren't exactly at the top of the list.

No, he did not believe that optimization could help, so he preferred to blame the M1 in his hands rather than return it, he argued for so long that he lost the possibility of returning the product, I guess.

I can well understand that. My nephew is not happy with his teddy bear, but if I dare to take it away, he will immediately make a fuss.

Sopel · Nov 30, 2021

Apple users: This workload is bad on M1 because it's for ****ing peasants. Who would even use a chess engine? PATHETIC
Also apple users: People laugh at us because they are envious!

btw. I'll let you know that Stockfish is optimized for M1, we have manually written NEON code and compiler option to optimize for apple-silicon. If you think you can do better the burden of proof is on you.

cmaier · Nov 30, 2021

Sopel said:
Apple users: This workload is bad on M1 because it's for ****ing peasants. Who would even use a chess engine? PATHETIC
Also apple users: People laugh at us because they are envious!

btw. I'll let you know that Stockfish is optimized for M1, we have manually written NEON code and compiler option to optimize for apple-silicon. If you think you can do better the burden of proof is on you.

You manually wrote a compiler option? What?

Appletoni · Nov 30, 2021

Taz Mangus said:
It is called optimization of hardware. So now you are going to complain that Apple has some sort of unfair advantage.

Of course if you want you can call Apples media engines: optimization of Hardware
Hardware-accelerated H.264, HEVC, ProRes, and ProRes RAW, Video decode engine, two video encode engines, two ProRes encode and decode engines…
This way it is easy to break ever video,… benchmark.
Some engines for integer math and other stuff would be great too.

cmaier · Nov 30, 2021

Appletoni said:
Of course if you want you can call Apples media engines: optimization of Hardware
Hardware-accelerated H.264, HEVC, ProRes, and ProRes RAW, Video decode engine, two video encode engines, two ProRes encode and decode engines…
This way it is easy to break ever video,… benchmark.
Some engines for integer math and other stuff would be great too.

They have engines for integer math. They are called CPUs.

Sopel · Nov 30, 2021

cmaier said:
You manually wrote a compiler option? What?

You know what I meant, stop playing dumb trying to derail the argumentation

Appletoni · Nov 30, 2021

pshufd said:
Just curious why you care so much and that explains why.

People don't care about chess.

Look at the net worth of Magnus Carlson and compare it to Lebron James. Or Tom Brady. Or even Roger Federer. Tennis is not even in the big money leagues but it's still way more than chess. Look at the sponsorships in Chess. Do chess players get multimillion dollar clothing contracts? Do they get clothing contracts at all?

If you want chess to run well on Apple Silicon, optimize it yourself.

Funny joke.
It‘s very obvious that you haven’t read the latest newspapers xD

Appletoni · Nov 30, 2021

robco74 said:
You deride Apple for using proprietery (sic) technology, but then praise CUDA? You also complain about comparing performance on battery mode, which is perfectly valid for a laptop comparison? How many more escape hatches are there to get through?

M1 has been out for little more than a year. The fact that chess engines haven't yet been fully optimized isn't earth shattering news. Apple, and others, have put resources toward the areas of greatest need. Chess engines aren't exactly at the top of the list.

If chess engines are slow on the M1 Max, then we have automatically other stuff which is slow too.

leman · Nov 30, 2021

Sopel said:
btw. I'll let you know that Stockfish is optimized for M1, we have manually written NEON code and compiler option to optimize for apple-silicon.

Based on your text I assume you are one of the developers. Great! Finally someone with a little sense here. If I understand it correctly, your SIMD code does computation on neural networks. We know that M1 has plenty of vector units and generally does excellently in SIMD throughput workloads (I am taking NEON here, not AVX or the NPU). That it performs poorly in your code can have two explanations: a) there is indeed something about the nature of your workload that hits a slow path on M1 hardware or b) your code is not optimal. Personally, I think that a) is less likely, since M1 has the bandwidth and the ALUs to excel at most types of matrix code.

So here a few questions:

- does your code prefetch the data?
- does your code use multiple SIMD streams to make sure you have enough ILP for multiple vector units?
- does your SIMD code rely on long dependency chains that might reduce ILP?
- have you profiled the code on M1 and found that it results in optimal occupancy of vector units?

Sopel said:
If you think you can do better the burden of proof is on you.

This is fair, but unfortunately I personally have little interest in chess engines. Anyway, just because you wrote some NEON code it does not mean that it runs optimally (I have written about it in #550).

Appletoni · Nov 30, 2021

Taz Mangus said:
How about a real world comparison, on a real world usage, of the M1 Max against a RTX3090, AMD Ryzen 9 5950x. I am sorry I could not find an example of a chess program simulation benchmark, so you will just have to settle for this.

You know that he earns money by showing benchmarks of the new M1 Max to Apple fans?
Obviously no one will show bad results to (Apple) fans.
If I would sell VW cars, then I wouldn’t show bad results to the customers.

Sopel · Nov 30, 2021

leman said:
Based on your text I assume you are one of the developers. Great! Finally someone with a little sense here. If I understand it correctly, your SIMD code does computation on neural networks. We know that M1 has plenty of vector units and generally does excellently in SIMD throughput workloads (I am taking NEON here, not AVX or the NPU). That it performs poorly in your code can have two explanations: a) there is indeed something about the nature of your workload that hits a slow path on M1 hardware or b) your code is not optimal. Personally, I think that a) is less likely, since M1 has the bandwidth and the ALUs to excel at most types of matrix code.

So here a few questions:

- does your code prefetch the data?
- does your code use multiple SIMD streams to make sure you have enough ILP for multiple vector units?
- does your SIMD code rely on long dependency chains that might reduce ILP?
- have you profiled the code on M1 and found that it results in optimal occupancy of vector units?

This is fair, but unfortunately I personally have little interest in chess engines. Anyway, just because you wrote some NEON code it does not mean that it runs optimally (I have written about it in #550).

> does your code prefetch the data?

There is no need for prefetching as the most costly part of the network is about 16kiB in size. The largest layer is a lot of sequential accesses that cannot really be prefetched because they are known too late.

> does your code use multiple SIMD streams to make sure you have enough ILP for multiple vector units?

This is a valid concern, because it's not done explicitly. I cannot do any profiling because I don't own an M1 machine. If you have proof that the vector units are not saturated during inference then I may be able to help. I've stated that same thing for the last few months but apparently no M1 user is capable of providing a profile.

> does your SIMD code rely on long dependency chains that might reduce ILP?

The NEON one, possibly. Again, no one ever provided profiler output so I don't know.

> have you profiled the code on M1 and found that it results in optimal occupancy of vector units?

No, I don't own an M1. No one who has access to an M1 did (which is weird considering this discussion is already more than 20 pages long. Have you guys been throwing empty words for so long?).

leman · Nov 30, 2021

Sopel said:
> does your code prefetch the data?

There is no need for prefetching as the most costly part of the network is about 16kiB in size. The largest layer is a lot of sequential accesses that cannot really be prefetched because they are known too late.

> does your code use multiple SIMD streams to make sure you have enough ILP for multiple vector units?

This is a valid concern, because it's not done explicitly. I cannot do any profiling because I don't own an M1 machine. If you have proof that the vector units are not saturated during inference then I may be able to help. I've stated that same thing for the last few months but apparently no M1 user is capable of providing a profile.

> does your SIMD code rely on long dependency chains that might reduce ILP?

The NEON one, possibly. Again, no one ever provided profiler output so I don't know.

> have you profiled the code on M1 and found that it results in optimal occupancy of vector units?

No, I don't own an M1. No one who has access to an M1 did (which is weird considering this discussion is already more than 20 pages long. Have you guys been throwing empty words for so long?).

Ok, once my M1 16" has arrived I will be happy to contribute profiler output. Do you have instructions on how to collect it?

Apple M1 CPU & GPU speed is very disappointing

macrumors regular

macrumors G4

macrumors G4

macrumors G4

macrumors 604

macrumors G4

macrumors 604

macrumors 6502a

macrumors G4

macrumors 6502a

macrumors 65816

macrumors Core

macrumors 6502

macrumors 6502

macrumors member

Suspended

Suspended

Suspended

macrumors member

Suspended

Suspended

macrumors Core

Suspended

​

macrumors member

macrumors Core

Our Staff