That's part of interleaving though. And in most cases, more RAM is usually far more advantageous than raw throughput, as precious little software is capable of utilizing it fully (i.e. 1x DIMM per channel).
First, no application software change is required. The chip should do the parallelism transparently. If you put a crossbar switch between the 4 cores and the 3 memory controllers, then 3 different cores can pull data from three different controllers in parallel with
no change in software. If you don't have too much data sitting behind a single controller then the work will naturally spread out. In contrast, if put a high amount of memory on the other side of a controller than it is more likely that 2 cores will both be pulling data from that pool of memory. At the extreme, if all four cores are pulling data from a single memory pool you have the "front side bus" configuration that was abandoned by this Nehalem architecture. Intel has a pretty bonehead design if that parallelism doesn't get leveraged on a common basis. Likewise, the OS is pretty bubblegum if it tends to herd multiple applications into single memory pools.
There is an additionally incremental speed hit if interleave, but not leveraging the multiple memory controllers is a waste of money on the new tech. Might has well stuck with a Core2 arch box.
Secondly, lots of memory slots is driven as much for many folks because of the current nonlinear pricing on 4GB DIMMs. If 4GB DIMMs were 2.0 or even 2.2 times as expensive at as 2GB DIMMS most folks would use 4 4GB DIMMs to get to 16GB rather than buying up 8 2GB DIMMs. Currently, they are 3x as expensive. The double down is to get cheaper DIMMs not necessarily more memory. Those are not necessarily equivalent in many cases ( sure the folks at the fully populated DIMM slot extreme but...
When 4GB eventually drop into the 2.2 multiple range the Mac Pro design will make more sense to a larger set of folks. Even so, for the single processor package set up can go 8GB and for dual package set up 16GB without dipping into the 4GB DIMMs. For the folks whose working set is smaller than that, it works and it is faster (if the work doesn't collide more than average.).
Double down only has big impact for those who have immediate term need for > 8GB (to avoid swapping) , but less than 4 cores of parallelism. There are some folks who need 32GB dual package set-ups, but not a huge number.
configuration would have had to be skipped (processor + RAM anyway, they still could have attempted a daugherboard for RAM only).
A daughterboard just for RAM is misguided. What you want are taller heatsinks on the processor packages since they are going to be more effective. One thing that the daughtercard does is orient the heatsinks in the same direction as the longest dimension of the box. There is no significant height constraint for RAM need to avoid. It is more a surface area constraint... which again going to the daughter card allows more efficient use of space because consuming "width" not "height" in consumption and keep trace lengths down.
Not really, as a cluster is where the trends are going for "heavy lifting", as I'm sure you're aware.
That is not a new trend. That is where the computer market has been since the start. Big machines did big jobs. The only "new" factor is gluing a "big machine" out of smaller decoupled components.
You are skipping past the point I was making. Workstation are not going to obviate all clusters. Most clusters are expensive. That means you have to share with someone else to make them cost effective. Workstations give you the freedom to do whatever crazy computational load you want and nobody else cares because they are using something else. For folks with heavy jobs that fit on a workstation it often pays to get the workstation (given would have to "pay" to get time on the cluster). Can also prune off smaller test run jobs just to see what the computation does without burning tons of cluster time. Once have the tweaks right can run larger job on the farm.
I get your point in the context where might have a number of workstations in a LAN and only infrequently/nonconcurrently a single person at a time needs to run a job on the cluster (perhaps overnight when everyone goes home. John Doe gets Monday and Wednesday nights, Fred Flintstone gets Tuesdays and Thursday, Barney gets Friday and Saturday, etc. )
Even Blade servers come as often with DP setups as SP ones. As long as there is an ever increasing ways to consume computational cycles, it will keep the pressure up on cluster nodes to pack more cores into a single unit.
You still have data locality issues, which if can avoid, gets you results faster if move the cores closer to the data. There is also little benefit in have a higher number of power supplies to feed.
If all the heavy lifting is done in the cluster could just get a Mac mini since main app running to accomplish that is an Xterm or VNC on the real machine.
and the newer parts will make it more cost effective (fewer systems necessary to build the cluster for a given performance level with those currently available).
And when the number of systems is one then don't need anything more than a workstation if workstations and servers use approximately the exact same computation engine.
The other problem is software costs. If the software is licensed per system and it is a significant percentage of the Workstation/Server it costs more to put it into a cluster. For example (assuming workload is embarrassingly parallel) :
4 Mac Minis ==> 8 cores ( approx $2,400 hardware cost)
1 Mac Pro Octo ==> 8 cores ( approx $3,200 hardware costs).
However, if need to buy an $800 software license per system then the total costs are:
Mini Cluster system ==> $5,600
Mac Pro Octo system ==> $4,000
If software is licensed per core then Mini cluster could be cheaper.
This is almost the mirror image of folks are avoiding non linear DIMMs; only it is software system (versus core) licensing.