Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

trankdart

macrumors member
Original poster
Sorry if this is a noob or oft-answered question. I have searched (a little😱).

I know the newer Intel QPI processors have three main-memory channels, and I've seen much discussion and several complaints on these threads that Apple didn't expand their memory risers to squeeze in six DRAM sockets instead of four, so that the module count on each riser could be a multiple of 3. (And also that they should have done this for the 2009 Nehalems).

Here's my question: what kind of performance would I be losing and why if I got one of these new Westmere machines, discarded Apple's stock memory and replaced it with 4--rather than 3--OWC 4GB sticks? What I specifically don't understand is, can't the QPI fetch from any 3 of the 4 modules at once? Why is there any performance penalty at all assuming the 4 modules are identical?

Thanks for any help. A link to a white paper or anandtech-type page with the technical details would be greatly appreciated.

TD
 
It's not bad at all. I have 4x4GB sticks from OWC in mine, and it has been great. I've even used all 16 gigs during a project.
 
Still, it's common to hear talk about 3 sticks of memory being more ideal for triple channel memory.


So I'm curious as well as to what effect it has.
 
Still, it's common to hear talk about 3 sticks of memory being more ideal for triple channel memory.


So I'm curious as well as to what effect it has.

As I understand it, it runs in two channels instead of three, so long as the RAM is all matched. Tests have indicated a negligible difference in speed, according to those that have done the tests. I can dig up the thread...
Edit: I'm having trouble getting the link on my iPhone, but search this forum for a thread from last May called "12 Gb or 16 Gb RAM and my programs," or something like that. They explain it well. Basically, most software isn't written to take advantage of triple channel.
 
The memory controller can't just interact with any 3 DIMMs in the system on the fly. That is why where the DIMMs are installed sets how it operates.

I don't know if this link makes it any clearer for you. Look up how memory channels work or interleaving data if you want more information or to get a better grasp of things.
 
Tests have indicated a negligible difference in speed, according to those that have done the tests. I can dig up the thread...

I'm one of the testers. I did 6GB (3X2GB) and 8GB (4X2GB). In every real world tests, the 8GB came up faster in PS4.

https://forums.macrumors.com/threads/736539/

I'd be curious to see those same tests done in 10.6, with PS5, but I'm pretty sure that they'd be the same. Very very few apps can saturate the RAM bandwidth, and unless your app can do that, you won't see any benefit of using less RAM in triple channel.

Loa
 
Thanks to those who replied, especially Umbongo. The diagram at the link in his post seems to clear it up. it doesn't seem like there should be any penalty for using all four memory slots.

The drawback to having four slots appears to be just a restriction to less RAM than the computer could otherwise use with the same 3-channel bandwidth if six slots/DIMMS were present, e.g. it would go up to 48GB instead of 32GB with 8GB sticks.

In fact OWC's test charts show that performance on huge Photoshop test jobs improves going from 12GB to 16GB on a Quad 2009 MP, confirming Loa's answer above.

I did find the "12gb vs. 16gb" thread refered to by wonderspark. That thread was interesting but unfortunate, since one poster was promoting the rather terrifying idea that application-level software has to somehow be modified to take advantage of the Nehalem 3-channel memory architecture in the same way it has to be optimized for threading. This is 1000% false. Individual memory channel operation is transparent to all software including the operating system and the EFI, let alone applications. Overclocking the entire memory subsystem doesn't count...channels can't be overclocked individually.


TD
 
That thread was interesting but unfortunate, since one poster was promoting the rather terrifying idea that application-level software has to somehow be modified to take advantage of the Nehalem 3-channel memory architecture in the same way it has to be optimized for threading. This is 1000% false. Individual memory channel operation is transparent to all software including the operating system and the EFI, let alone applications. Overclocking the entire memory subsystem doesn't count...channels can't be overclocked individually.
You probably saw my posts. Take a look here, and hopefully you may see what I was getting at (namely that the compiler used isn't actually designed for the Nehalem architecture). Pay particular attention to the following sections:
  • Factors affecting optimization
  • Intended use of the generated code
For example, let's say the software's compiled on a tool designed for some form of FSB. As that particular architecture is different from the DDR3 controller, the Pipeline structures aren't optimized for the DDR3 controller on the newer CPU's. So I/O is handled less efficiently than it could be, and has a penalty in throughput. It will still generate a result (work), just not as quickly as the actual hardware being used is capable of generating it (assuming the actual application can benefit from the memory architecture in the first place, as many can't).

Due to the need for code to run on multiple CPU's (backwards compatibility), this is usually considered an acceptable compromise, as the application can be purchased and used by more buyers.

But in cases where flat out performance is necessary on the newer architecture (such as Nehalem or Westmere based servers), either the compiler has to be re-written to accomodate the newer architecture, or IO pipeline bottlenecks in the compiler addressed in the application itself (less than ideal).

Hopefully you understand the point I was trying to make. 🙂
 
You probably saw my posts. Take a look here, and hopefully you may see what I was getting at (namely that the compiler used isn't actually designed for the Nehalem architecture). Pay particular attention to the following sections:
  • Factors affecting optimization
  • Intended use of the generated code
For example, let's say the software's compiled on a tool designed for some form of FSB. As that particular architecture is different from the DDR3 controller, the Pipeline structures aren't optimized for the DDR3 controller on the newer CPU's. So I/O is handled less efficiently than it could be, and has a penalty in throughput. It will still generate a result (work), just not as quickly as the actual hardware being used is capable of generating it (assuming the actual application can benefit from the memory architecture in the first place, as many can't).

Due to the need for code to run on multiple CPU's (backwards compatibility), this is usually considered an acceptable compromise, as the application can be purchased and used by more buyers.

But in cases where flat out performance is necessary on the newer architecture (such as Nehalem or Westmere based servers), either the compiler has to be re-written to accomodate the newer architecture, or IO pipeline bottlenecks in the compiler addressed in the application itself (less than ideal).

Hopefully you understand the point I was trying to make. 🙂

Yes, I was referring to your posts, but I'm afraid I don't understand the point you were trying to make.

I might understand it better if you could give me an example of a compiler that generates different code when it knows the program it is compiling will run on a system with an FSB. I understand how a compiler might arrange machine instructions in a different order if it knew that the program might run on a hyperthreaded core. But I don't understand what it would do differently for a triple-channel memory subsystem as opposed to an FSB.

In fact, I believe that it would do nothing differently because there is nothing that can be done, and that therefore no compiler exists that takes into account the presence of an FSB.

If you can point me to one, or at least tell me how it operates, I'll be glad to learn something new.

TD
 
I might understand it better if you could give me an example of a compiler that generates different code when it knows the program it is compiling will run on a system with an FSB. I understand how a compiler might arrange machine instructions in a different order if it knew that the program might run on a hyperthreaded core. But I don't understand what it would do differently for a triple-channel memory subsystem as opposed to an FSB.
You set it via flags, and the age and capability of the compiler matters (i.e. older versions won't have the optimizations for newer CPU architectures). The -xT flag for example (Intel C++), which was meant for Intel Core2 Duo processors, Intel Core2 Quad processors and Intel Xeon processors with SSSE3 (i.e. 2006). Specifically, this flag is meant to assist with memory access.

Here's an article comparing which (Sun Free Studio v. GCC) is faster for Nehalem. Perhaps this might shed a little light for you, as it contrasts why GCC lost out at the time the article was written (simple article, not deep at all).
 
I might understand it better if you could give me an example of a compiler that generates different code when it knows the program it is compiling will run on a system with an FSB.

It isn't just different code, it would have to be different code that makes a difference. With processors like the current Intel x86 there is lots of dynamic instruction reordering done in the decode/dispatch process.

You might change

load mem1
do some instr.
load mem2
do some instr.


versus
load mem1
load mem2
do some instr.
do some instr.

but they basically execute in same time. So the triple channnel memory will add non stalling pipeline fetches in several places ( so you don't have to hoist instructions to what you think is a pipeline stall ), but the dynamic scheduling can take care of that also. In other words the optimizer doesn't have to have prefect knowledge of what the dynamic pipeline delays are.

If these were strictly in-order processors and all the optimization/reordering had to be done statically then perhaps would be a bigger deal of what the exact predictions for the pipeline stall percentages and cache/memory delay stalls are. With dynamic adjustments, you can be slightly off and still run full speed. Hence, the reason why Nethlem and Westmere processors generally do better than previous single controller archs even at slightly lower core clock speeds.



I understand how a compiler might arrange machine instructions in a different order if it knew that the program might run on a hyperthreaded core.

What are those? If referring to how might string out Load/stores a bit because these are partitioned (i.e., a smaller subset) then there are some similarities. If know addresses can be kicked off in a pipelined fashion (can increase the number of outstanding requests with no ill effects) can reshuffle the instruction stream. (whereas in the hypercode context may want to stretch them out slightly.)
 
I'd be curious to see those same tests done in 10.6, with PS5, but I'm pretty sure that they'd be the same.

Not sure why not because that one, like several other here, tests the non effect when introduce significantly more disk accesses. Since disk accesses are an order of magnitude slower, they totally swamp any differences in memory performance. In short, they don't measure differences in memory set ups.

Unless, 10.6 crumbles the effectiveness of the disk buffer cache ( which more memory not used by app is used by the OS to cache more disk content) then probably will be the same.



Very very few apps can saturate the RAM bandwidth,

Hyperthreading wouldn't work if that were true. Fact is that 10-30 instructions with a modestly high rate of load/store instructions can saturation. The lengthly delay waiting on memory is what is used to allow resources from another hyperthread that are local to the CPU to start up and get work done while waiting.


and unless your app can do that, you won't see any benefit of using less RAM in triple channel.

And yet in several benchmarks done here the 2.26 Nehalem does better than the 2.8 Hapertown :

http://www.macworld.com/article/139507/2009/03/macpro2009.html

Having more than one has an effect. Whether you can milk every last drop out is a much more a matter if your app can avoid tapping the disk too often than whether get 2 or 3 channels going with a set of pipelined requests.
 
You probably saw my posts. Take a look here, and hopefully you may see what I was getting at (namely that the compiler used isn't actually designed for the Nehalem architecture). Pay particular attention to the following sections:
  • Factors affecting optimization
  • Intended use of the generated code
For example, let's say the software's compiled on a tool designed for some form of FSB. As that particular architecture is different from the DDR3 controller, the Pipeline structures aren't optimized for the DDR3 controller on the newer CPU's. So I/O is handled less efficiently than it could be, and has a penalty in throughput. It will still generate a result (work), just not as quickly as the actual hardware being used is capable of generating it (assuming the actual application can benefit from the memory architecture in the first place, as many can't).

Due to the need for code to run on multiple CPU's (backwards compatibility), this is usually considered an acceptable compromise, as the application can be purchased and used by more buyers.

But in cases where flat out performance is necessary on the newer architecture (such as Nehalem or Westmere based servers), either the compiler has to be re-written to accomodate the newer architecture, or IO pipeline bottlenecks in the compiler addressed in the application itself (less than ideal).

Hopefully you understand the point I was trying to make. 🙂

Deja vu... We had this debate before... 🙂 there are no code optimizations needed to take advantage of different physical memory configurations, just as there aren't any code optimizations required for the OS to take advantage of different disk storage systems (eg. RAID). Memory and disk storage systems are abstracted up to the OS and Applications.
 
Yes, sorry for the noise Virtual Rain. I saw your earlier thread and understand that you've attempted to explain it to them already. It's enough of this, it's hopeless. But at least I got my question answered: it won't hurt to use all four DIMM slots. That part was helpful, at least to me. 🙂

TD
 
One curiosity of the Nehalem architecture is that to get full RAM speed (for those chips capable of the 1333MHz RAM) required only one RAM chip per channel so adding a fourth would slow the whole system.

BUT this wouldn't affect the Mac Pro anyway as the RAM is restricted to the slower speed anyway, also Westmere probably doesn't have this restriction. Even with Nehalem, manufacturers such as Sun were producing systems that would run at 1333MHz with more than one slot filled per channel.

Even with the RAM speed being slower, the latency is not much different so the overall speed change is probably negligible.
 
And yet in several benchmarks done here the 2.26 Nehalem does better than the 2.8 Hapertown

Well you're comparing a lot more than dual and triple channels here!

Show me a test that shows PS5 working significantly faster using triple channel on a 2009 MP, and I'll do it myself! (something like 6GB Vs 8GB, or 12GB vs 16GB)

Loa
 
Very very few apps can saturate the RAM bandwidth

Hyperthreading wouldn't work if that were true.

Don't confuse latency with bandwidth.

The reason hyperthreading is usually a net performace improvement is because it works around high latency., i.e. it gives the CPU something else to do after a cache miss. If you are instead out of bandwidth, there is no performace gain from running more concurrent processes (i.e. no point to hyperthreading).

This is not just simple semantics. This is a very very important distinction. It's (relatively) trivial to increase bandwidth, just add more parallel memory channels like Intel did. It's very hard to reduce latency. That's why recent Intel chips have 3 levels of cache and also have sophisticated cache management algorithms.

Finally, I agree with the GP poster. Most apps don't come close to saturating the RAM bandwidth. When Intel moved the memory controller off the north bridge, it really improved the bandwidth (and also btw really improved latency).
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.