I
The architecture will allow for multiple DIMM's per channel (interleaving), but Apple didn't balance this out (2x DIMM slots per channel for example),
what? were is the techinical backup on this?
In a 4 DIMM slot set up
one of the memory channels is in a interleaved mode. The other 2 are not (or at least don't have to be with proper support). Moving just one of the channels to interleaved mode minimizes the impact (versus moving all 3).
You
still have triple channels. You only "loose" the triple channel effect when all the cores happen to pull data out of the same channel. ( e.g., all four cores work on a single 500 MB image which is purely allocated to just one DIMM or , in 4 DIMMs filled set up, purely allocated across the two DIMM interleaved on the 3rd channel. )
There is no change in the software. The software just says "get me memory at address 1234000" it is the CPU's MMU and memory subsystems job to go get it. That is opaque to the software which exact physical DIMM it is on. The app doesn't even use physical memory addresses.
Apple supports 4 DIMM slots because it is the natural way to get to 8GB using 2GB DIMMs. (and 16GB in the dual package set up.) Most customers are currently going to want to avoid 4GB DIMMs because not quite as cost effective.
It isn't theory.
If you put one 4GB DIMM versus 4 1GB DIMMS you'd see a difference on anything that was actually dependent upon memory transfers and large enough to force as spread across DIMMs ( versus micro benchmarks that load everything into the L3 cache or have disk I/O marks that stall the cores so much that the number of no-ops is sky high.)
In the first case you essentially have a front side bus. You know... the arch that AMD abandoned a long time and and Intel has also abandoned with much fanfare with Nehalem. To toss that diference off as pure occasional theory benefits.... why have they both changed. Because it makes no difference? Numerous benchmarks state otherwise.
But exceptionally little software can actually utilize it,
The triple channels are implemented in hardware. The software has no choice to utilize it. The OS memory page allocation code may not be completely milking all of the advantage out it but don't need to change application software to get the effect.
A very straightforward case is when a OS has a file buffer cache and does copies to the apps address space when data is read. One core can be doing look ahead reads filling up the cache and another core copying into apps address space. While both will need to serialize when both accessing the buffer cache when writing/reading asynchronously can both proceed in parallel unblocked. If have a single front side bus all those accesses to memory must be serialized and you get more slowdowns.
Another case is when after the OS has allocated/deallocated the physical memory pages for the creation/destruction of serveral applications the physical page allocations start to somewhat randomly distribute across DIMM boundaries (e.g, an app gets last 20 pages on one dimm and rest on another. If the even of the app thinks a data structure is localized it access still can be parallelized. )
Pragmatically since the kernel code and data structures start up first they tend to segregate on a different DIMM than the application code. Similarly, if running multiple apps which are periodically accessing and doing things in the background they too, on average tend to disperse off to different DIMMs if have low GB, single DIMM per channel set up.
The effect tend to disappear as put very large GB memory pools on each of a limited number of channels. Folks who jam as much memory as possible into the box and then invoke narrowly focused concurrency loads which focus on relatively small problems ( relatively much smaller than channel memory pool size) will see much smaller effects.