Oh dear. This is becoming probably way off-topic and useless to readers of this forum. I apologize again for the tone of my previous post, Nanofrog. Stuff like that is never warranted and I shouldn't have indulged myself. I swear that this long and tedious response will be my last on this topic.
.....
Granted, the articles are written about MPSoC's, the fundamentals still apply (Intel and AMD parts are just more complex).
No. It is precisely because Intel and AMD parts are so different from MPSoC systems that the fundamentals discussed in those first two articles do not apply.
For the narrow, highly-specialized issues of design choices in embedded MPSoC systems, which are virtually always dedicated to a single application or small group of closely-related applications, these issues are valid. But they are not relevant to a Mac Pro user or prospective buyer. Notice that the first article explicitly states:
Such an architectural exploration scheme is quite different from the development of general-purpose computer systems that are designed for good average performance over a set of typical benchmark programs that cover a wide range of applications with different behaviors...in the case of embedded systems, the features of the given application can be used to determine the architectural parameters...Unlike a general-purpose processor, in which a standard cache hierarchy is employed, the memory hierarchy - indeed the overall memory organization - of an MPSoC-based system can be tailored in various ways.
Then there is this:
The memory can be selectively cached; the cache line size can be determined by the application; the designer can opt to discard the cache completely and employ specialized memory configurations such as FIFOs and stream buffers; and so on.
The EE Times articles involve an embedded environment where the designers, (i.e. the guy who designs the CPU and memory systems and the guy who writes the software sit in adjoining cubicles and argue about cache design and behavior) have pretty much complete control over how hardware and software interact. As the article clearly states, this is very different from a mass-market CPU system that dudes liquid-cool and overclock. For example, the articles you cite mention that in power-efficient embedded systems the memory system must be designed carefully because it might account for 90% of power usage (since the CPU's themselves are a bunch of tiny units that may be strictly-speaking "general purpose" but tend to have small array-oriented instruction sets). Observe that here on the Mac Pro Rumors forum, half of the threads involve users asking whether they should buy one or two high-performance GPU cards each of which dwarfs the 130W thermal footprint of a 6-core, very complex general purpose CPU. (Actually the GPU cards more closely resemble the systems discussed in your EE Times article than the Mac Pro does).
The question of optimizing the use of FSB's and memory channels would concern an MPSoC embedded system designer
only because he is designing the memory system as well as the specialized and highly constrained software that uses it. He would not have to expose a programming interface to an operating system or an application, in order to optimize them; he would hope that hardware was optimized on paper during the design phase. And his compiler would automatically generate code that accounted for that memory architecture without any special flags. These issues don't face Mac application developers.
If compiler does this correctly however, the end result is a high level language compiler that the programmer can view it as abstract (there are still flags that can assist with speed optimizations for the core/s, memory, and disk IO for example).
There are exceptions; but as a general rule, disk I/O is so far from an application in a heavy-duty general purpose operating system like Mac OS (Unix) that a developer could not even begin to think about writing application-level code that would optimize I/O performance with
respect to the architectural specifics of a memory controller.
This is the area where I believe you are conflating two somewhat orthogonal issues. Take for example your example about optimizing cryptography algorithms on the Nehalem. One of several reasons I chose the 6-core Mac Pro was precisely because my favorite desktop open-source crypto utility,
TrueCrypt, is optimized (whether by a compiler or hand-written assembler code I don't know) to use the new AES instructions that are present in the Westmere but not in the Nehalem. Compilers can certainly generate better array-manipulation code for different levels of the SSD instruction set, and can even optimize for that dynamically at run time.
But what do we see in your articles when they talk about optimizing for the Nehalem architecture? Nothing very new, and specifically nothing that mentions FSB's vs. QPI's. They say instrument the code, run the code, find out where the bottlenecks are, optimize the code. Hey, wish I'd thought of that!!! That applies to all software, all the time. No non-trivial software can ever be optimized to run optimally on a wide array of unspecified general-purpose systems. So yes, all software, including Photoshop and Final-cut Pro and Plants vs. Zombies and World of Warcraft, could be better optimized if the developer was given a much narrower set of assumptions about the underlying hardware.
I appreciate your work in digging up all those interesting citations. But please note this key point: in the article you cite where memory system architecture is central, the focus is on dedicated embedded systems where software and hardware can be designed and optimized in tandem. In the articles you cite that discuss the optimization of algorithms on general purpose systems such as Intel and AMD, no mention is made of memory controller architecture, particularly FSB's and channels. This is not an accident, because optimization fundamentals in those two kinds of systems are radically different.
In the memory subsystem of a Mac Pro, the hardware puts an address on some pins, and a little while later the main memory puts some data on some other pins. If the memory is under the control of an older memory controller (i.e. an external northbridge chip that adds some latency to the mix), the data comes back a little slower. If the memory is controlled by an integrated multi-channel QPI setup that can achieve more efficient hardware-level concurrency through interleaving, the data comes back a little faster. That's it. In either case, the only optimization that can be done is to overclock the entire memory, or to omit sections of the main memory from the on-chip caches. In either case, although the compiler has lots of leeway to re-order instructions to reduce the liklihood of stalls, and to align instructions on word boundaries to avoid unnecessary memory cycles, there are
no local optimizations that either a compiler or a human programmer can do that will create code that runs better when the memory controller is internal with three channels instead of external with two. All your citations are interesting, but none of them, in the final analysis, address the point we have been (perhaps needlessly) disputing: many optimizations are possible to software. Software is never fully optimized. But in the Mac Pro, none of the possible local optimizations (even including word alignment) involve deliberate manipulation of the memory controller's behavior
Best wishes and sorry again for my grumpiness.
TD
Edit: the example given in your Nehalem article about optimizing the RC4 encryption algorithm by manually (not via compiler) re-ordering the C code is instructive since the assembler output for the resulting changes is provided. The example is embedded in an article with Nehalem in the title but this optimization never mentions Nehalem or memory channels and was in fact done on a Harpertown CPU. It's simply a clever way to delay use of an intermediate computational result to avoid a stall. It's a good example of how unpredictable the output of compilers can be and of clever manual loop code optimization. It has nothing to do with memory controller architecture.