I was dealing specifically with Symetrical Multiprocessing & memory management, while you seem to be focusing on the physical layers.
We are talking about dual/tri-channel memory which is an element of the physical implementation of RAM.
There's a really simple fact that demonstrates there's a software influence, as if the bandwidth usage was purely a function of the memory controller in the Nehalem parts, then all applications would be able to utilize triple channel operation. But that's not the case at all. Some applications can, most can't.
Your observation that tri-channel appears to have little impact is correct... however you are in error as to the root cause. Rest assured that with 3 identical DIMMS, the memory controller is interleaving across all three channels for all memory operations... all the time. All applications are using triple channel, you just don't see any difference between tri-channel and dual-channel in real-world apps because the CPU cache size is sufficient to fully contain the working set for most apps, and with a good prefetch engine as most modern CPU's have, a cache miss is a rarity.
As I continue to say, the concept is identical to striping across multiple disks in a RAID0 array. Clearly things scale as you add more disks... an array of 3 disks has better performance than an array of 2 disks. And of course, data is stripped across all 3 disks all the time. It's not a function of software. But the difference between a 2 or 3 drive array is subtle for most apps, it takes specialized benchmarks to reveal the true difference. Now, if you introduce a RAID controller with a huge cache, the differences between a 2 and 3 disk array get even more difficult to measure for most applications. If the cache size is big enough, it will make the RAID array seem to perform amazingly, regardless if the array has 2 or 3 drives.
The same effect is going on with CPU cache... it's so large that it makes dual-channel DDR2 look impressive, and pumping the memory bandwidth up with tri-channel DDR3 has little impact because the cache already contains everything the core logic needs to keep operating at peak efficiency. Only specialized benchmarks that bypass cache can measure the true bandwidth difference. The real-world effect of going from dual to tri-channel with such huge CPU caches is almost negligible.
It has to do with both the compiler (not just architecture optimization, but libraries/functions available that assist in multiple channel memory optimization), and the application code itself (writes code to use all available channels if it doesn't exist in the compiler). Worse yet a poorly written OS could hamper bandwidth utilization as well (assuming the application code doesn't try to bypass the OS's memory management). Unoptimized code is more likely to use a single channel's bandwidth rather than the parallelism that would be possible (and OS's aren't capable of forcing it either).
Unfortunately, it's more likely to find a multi-threaded application that's only using one memory channel, rather than one that can utilize both n cores
and n memory channels simultaneously (or a single threaded application that can use n memory channels).
Not at all true. The number of channels is a physical layer element that is abstracted to higher level functions. Software and the OS are not aware and don't care how many physical channels the memory controller is using to interleave operations. This is handled at a much lower level. Again, it's the same as the number of drives in a RAID0 array. The RAID controller knows how many channels there are, but to the OS and Apps, it's just a file system.
Ultimately, not all software has been written to utilize the bandwidth available on multiple memory channels (simultaneous channel operation, essentially SMP applied to the memory controller itself). Until that changes (software catches up), the full benefit won't be seen by many users, save a few specialized applications.
Software does not need to be optimized to take advantage of the read/write performance of a RAID0 array of 3 drives nor does it need to be for tri-channel memory. These are hardware level details that are abstracted to higher level functions.
BTW, The benchmarks and apps that make tri-channel DDR3 look good are not optimized in any way. They simply bypass cache (benchmarks) or have working sets that exceed cache size and therefore more readily reveal the underlying memory bandwidth bottleneck. Most apps are incapable of revealing this as their working sets all execute from within cache.
Also, if you are interested the history to this problem goes back to Netburst where Intel needed a way to overcome the limitations of the FSB. They started dedicating more silicon to cache to improve memory performance and with the Core 2 architecture settled on a standard of 2MB of cache per core... twice as much as AMD's architecture required since they had an IMC. Interestingly enough, Intel continues to use more cache than necessary with 2MB per core L3 cache, even now that they have an IMC. Most computer scientists are puzzled by this decision because Intel could now choose to reallocate some of that silicon in favour of improving the performance of core logic without starving the cores for data now. However, they've yet to reverse course on their excessive cache sizes.