CC-NUMA is programmer-transparent, too. A NUMA system typically has a scheduler that knows how to migrate pages from node to node and that understands processor affinity. So the program doesn't have to worry about memory allocation at the hardware level at all, except in the very most extreme cases.Originally posted by mathiasr
It should be entirely transparent... unless Apple shifts to a NUMA architecture where apps would have to avoid storing to much data in remote location memory, since each CPU would have local (fast) and remote (some what slower, remote is indeed local to another CPU) memory.
In an SMP design memory has always the same access time (as long the bus is not saturated).
In other words, if you start a process on processor 12, that process is generally going to stay running on processor 12, even after dropping out of the run queue for things like blocking I/O. That's processor affinity.
When a process running on processor 12 allocates a block of memory, the system allocates that memory in the closest bank of RAM, topologically speaking. If that's not on the local node, then the system will page-migrate the memory to the local node as local RAM becomes available. This happens completely invisibly from the perspective of the application.
Now, NUMA systems are designed in multi-dimensional cube topologies, which means that even in a thousand-processor system a processor is never more than seven router hops away from the furthest bank of memory. In a 1,024-processor SGI Origin 3000 system like NASA's Chapman, for example, the average memory latency is 480 ns and the worst-case memory latency is only 640 ns. Best-case latency is 170 ns. So the total variance from best-case to worst-case is only about 3.7. And that's across a thousand processors. Oh, also that's with standard memory modules. And remember this is CPU-to-main-memory. In the real world, latencies are much lower thanks to fat caches and predictive fetching. Particularly in sequential-read applications like video processing for example (read a byte, write a byte; or read a vector, write a vector), the vast majority of the time the cache line that the processor reaches for next has already been loaded into cache predictively, so sequential applications are very cache-friendly.
For comparison, the round-trip time to main memory in a Power Mac G4 is about 95 ns. Throw a NUMA memory controller in there, and you would see memory latencies pretty much in line with the fastest computers in the world, give or take a few percent. In other words, you would NOT have to worry about local versus remote memory.
Now, in HPC/TC applications, programmers can directly manipulate the memory topology of a NUMA system to suit their needs. They can allocate given blocks of memory on given nodes if they so choose. That kind of application tuning is saved for the sort of long-running jobs where tweaking the memory allocations might save you a week or two of computer time over the course of the run. You won't see that kind of optimization on anything less than a supercomputer for a long, long time, because it just won't be worth it. Hand-coding the memory handling in a NUMA version of After Effects might save you three minutes over the course of a year. Hardly worth the effort.
Now, none of this means anything, except this: if Apple were to decide to implement small-scale NUMA (< 16 processors) using chips that are instruction-set compatible with the PowerPC G4 family, it would not be necessary for application vendors to go back and change their code to run on the new systems.