An Apple GPU core contains 4x 32-wide ALUs, for 128 ALUs or “shader cores/units” in total. How exactly these work and whether each ALU can execute different instruction stream is not clear as far as I know. The Ultra will therefore contain 8192 “shader cores/units” and should support close to 200000 threads in flight.
I can definitely imagine performance tanking if not enough threads are supplied. GPUs need to hide latency after all.
Okay, I had to test this. I ran a ALU-heavy kernel (with low register pressure) on my M1 Pro on increasingly large arrays to see where the execution time 'jumped', to try to infer how many threads are running at once on M1 GPUs.
This is what I got:
There's one thread for each array element the GPU kernel is run on. Threadgroup size is set to maximum (1,024 threads per threadgroup on M1 Pro). The execution time jumps every 16,384 array elements / threads. If we focus on the first part of the graph, there's a slope between 0 to 1,024 threads:
What is happening here is: since threadgroups are 1,024 threads, if I dispatch any number from 0 to 1,024 threads, they'll all be in a single threadgroup (and executed in the same GPU core). The execution time in this region makes a small jump every 32 threads, and remains flat otherwise. That must be because the GPU core splits the threadgroup in batches of
32 threads (a single SIMD group) and executes those batches one after the other. For a 16-core GPU, only 32*16 = 512 are executing 'at once'.
Once the thread number goes over 1,024, another threadgroup is created, which runs in a different GPU core, in parallel with the first threadgroup, so no extra time is needed as long as there's more GPU cores available.
This goes up to 1,024 (threadgroup) * 16 (cores) = 16,384. If there are more threads, they get bundled in a 17th threadgroup, but since there's only 16 cores in my M1 Pro, that extra threadgroup can't be executed until one of the 16 cores has finished the whole 1,024 thread threadgroup, essentially doubling the time it takes for the whole kernel to run. Another jump happens after 32,768 threads for the same reason (there are 33 full threadgroups for 16 cores, so the 33rd has to run after two full threadgroups have finished execution).
On the M1 Ultra, that'd mean that only 32 (threads per SIMD gorup) * 64 (cores) = 2,048 threads are truly executing 'at once'. However, the time it takes for an extra SIMD group to execute is relatively small (~9 μs in this kernel, the height of the jump between 512 and 512+32 threads) compared to the time it takes to set up a threadgroup (~630 μs for this kernel, the time it takes for just 32 dispatched threads, minus the 9μs of the SIMD group execution time). Therefore it pays off to have the threadgroups as full as possible. So ideally you'd try to have multiples of 1,024 (max threadgroup size) * 64 (cores) =
65,536 threads 'in-flight'.
I think the quoted number of 196,608 threads is 3 times 65,536 because the GPU has a compute, shader and vertex channels, and (probably) those can run concurrently for a total ot 65,536 * 3 = 196,608 threads in-flight. But for a single kernel you'd only need to fill 65,536 at most. Whether that number is high enough to make the M1 Ultra scale poorly, I don't know. But I don't think it's
that high.