There are three main areas: CPU, GPU and various accelerators, esp. for video. For each of those categories the Ultra has more.
CPU: Multi-core CPU scalability is a classic problem often described as "Amdahl's Law". A tiny fraction of serializable code will rapidly limit multi-core scalability. You might think if an algorithm is 75% parallel that's good. In fact that will mean going from 10 cores to 20 cores will have limited improvement. That is not unique to Apple Silicon but applies on any multi-core CPU. So Ultra (or any high-core-count CPU) requires more work from the programmer to effectively harness the available cores:
https://en.wikipedia.org/wiki/Amdahl's_law
GPU: In theory GPU should be the most scalable since the hardware and software was designed from the outset for parallel use. However as GPU core counts increase more attention is needed to obtain best scalability. This was discussed in the WWDC20 talk 10159, "Scale compute workloads across Apple GPUs". Areas that might need attention are serialization between CPU and GPU, improving memory locality, and monitoring new XCode performance counters such as MMU TLB Miss Rate:
https://developer.apple.com/videos/play/wwdc2022/10159/
Video accelerators: The Ultra has double the number of video accelerators as Max: 2 decode engines, 4 encode engines and 4 ProRes encode/decode engines. For the programmer to harness that full capability in parallel would require significant work, and for some codecs it might not be possible.
Right now Apple Compressor can automatically segment a ProRes input file to separate fragments and use multiple processes (each theoretically using a separate accelerator), then it concatenates the segments for final output. However there is currently limited scalability from this. It is not bottlenecked on CPU, GPU or I/O, so there is some internal limit.
For Long GOP codecs in theory those using independent or closed GOPs (which don't reference others) could possibly be segmented and likewise processed in parallel. The accelerator hardware would not know the difference. However anytime you split an interframe codec it is risky. That is essentially equivalent to utilities that do "no encode" trimming, which historically has been problematic for Long GOP formats.
Already it appears the multiple decoders on M1 Ultra are being used for multi-stream input, e.g, multicam. Unlike the single-file case, decoding multiple input files in parallel avoids the Long GOP segmentation risk. FCP is significantly smoother on M1 Ultra than M1 Max when reading Four-camera multicam composed of 4k H264. Both of those cases are not limited by CPU, GPU or I/O, so that implies the additional video decoders are being used.
It's a complex area of finding bottlenecks, examining what causes each one then assessing what is needed to improve it. With the end of Dennard Scaling around 2006 (
https://en.wikipedia.org/wiki/Dennard_scaling), each CPU core cannot keep getting faster from clock rate. That leaves more incremental per-core improvements from IPC, but mainly programmers must learn to use increased parallelism and specific hardware accelerators. Software methods that scaled OK when going from 4 to 8 CPU cores may require more stringent study to go beyond that. Likewise prior methods which worked OK for a lower-end GPU may not scale to larger ones, so the programmer must study and optimize this.
Traditionally there has only been a single video accelerator which was shared among all CPU cores. E.g, for most Intel CPUs there is only a single Quick Sync accelerator that's shared among all cores. That implies a simple programming model where all threads serialize waiting on Quick Sync. If after years of programmers doing it that way, multiple accelerators become available, a lot more thought is required to use those in parallel without concurrency bugs.