As usual, where are the links? 🙄
If the CPU has "chiplets" and Infinity Fabric - how can it not be NUMA?
Technically, you can make all access "far" for every one of the chiplets; then it is uniformly slower.
🙂
For a subset of the problem this is the path that AMD took.
https://www.anandtech.com/show/1356...n-approach-7nm-zen-2-cores-meets-14-nm-io-die
All the memory is hung off the I/O chip in the middle. ( this is slightly drifting back to "front side bus" days of yesteryear ... and the same bandwidth problems.). The chiplets
all come in through the 'infintiy' links (where the infinity markers on on the I/O chip in the middle. There is either a large crossbar switch or token ring to multiplex the access of all the cores to the DRAM segments of the I/O chip. Passing through that is probably uniform ( depending on the internals of the I/O chip. If the I/O chip has a "top' and 'bottom' half then may be some non uniformity there. )
What that image doesn't really cover though is is cache swaps. If the latest data is in the L1/L2/L3 cache of chiplet A and chiplet B now needs to get it then the swap will have to traverse the I/O chip to get it. if the data is in a L2 cache of another core in the same chiplet then it will be a 'local' transfer. Hence, still have NUMA if looking at the whole memory stack instead of just the RAM DIMMs.
Finally, there is a presumption here that inside the AMD chiplet access to the Infinity link to the I/O chip is uniform. Pretty good chance that it isn't uniformly a "1 hop" jump to the I/O chip from every single core. So still have NUMA.
Xeon E5 had NUMA with just a single CPU package. NUMA isn't necessarily precipitated by sockets.
https://www.anandtech.com/show/10158/the-intel-xeon-e5-v4-review/2
The 12 cores on the left have even access to the two DDR channel/memory controllers on that are also on the left (all sharing the same token ring network). However, to get to the other two it is a non-uniform hop to the other token ring and the other two memory channels.
The SP ( and Xeon W and Core i9 ) has a mesh instead of a token ring with even larger non uniformity.....
https://www.anandtech.com/show/1155...-core-i9-7900x-i7-7820x-and-i7-7800x-tested/5
So there the core in the lower right has to jump through multiple of those yellow links to get to the memory controller on the left versus the other three cores on adjacent to the memory controller on the left.
The upside of the mesh is that if get to a point where only 1-3 cores are active then shrink down to a subset that are not the farthest away from any one memory controller (e.g., the far , four corners). Still is NUMA, but control the scope and impact.
[doublepost=1544883653][/doublepost]
...Just an aside, Threadripper probably isn't a good fit due to interdie latency. EPYC would be better as each die has direct access to RAM, but then you still have to deal with NUMA latency issues. Apple hasn't had to deal with that since the old Mac Pro days of multi socket systems. So I am not sure if they have optimized for UMA.
NUMA just about sockets. The 12 core Xeon E5 v2 has NUMA.
https://www.anandtech.com/show/7852/intel-xeon-e52697-v2-and-xeon-e52687w-v2-review-12-and-8-cores
Depending upon how much L3 synchronization traffic kicked up may not get uniform access to memory. The 12 core has token rings stretched a bit thinner than the lower core count variants. NUMA was still around in the MP 2013, but Apple could afford to mostly ignore it with a very limited impact.