Why is 4 DRAM sticks in a 3-channel system bad?

Discussion in 'Mac Pro' started by trankdart, Jul 30, 2010.

  1. trankdart macrumors member

    Joined:
    Jul 28, 2010
    Location:
    Los Angeles, CA, USA
    #1
    Sorry if this is a noob or oft-answered question. I have searched (a little:eek:).

    I know the newer Intel QPI processors have three main-memory channels, and I've seen much discussion and several complaints on these threads that Apple didn't expand their memory risers to squeeze in six DRAM sockets instead of four, so that the module count on each riser could be a multiple of 3. (And also that they should have done this for the 2009 Nehalems).

    Here's my question: what kind of performance would I be losing and why if I got one of these new Westmere machines, discarded Apple's stock memory and replaced it with 4--rather than 3--OWC 4GB sticks? What I specifically don't understand is, can't the QPI fetch from any 3 of the 4 modules at once? Why is there any performance penalty at all assuming the 4 modules are identical?

    Thanks for any help. A link to a white paper or anandtech-type page with the technical details would be greatly appreciated.

    TD
     
  2. wonderspark macrumors 68040

    wonderspark

    Joined:
    Feb 4, 2010
    Location:
    Oregon
    #2
    It's not bad at all. I have 4x4GB sticks from OWC in mine, and it has been great. I've even used all 16 gigs during a project.
     
  3. Ravich macrumors 6502a

    Joined:
    Oct 20, 2009
    Location:
    Portland, OR
    #3
    Still, it's common to hear talk about 3 sticks of memory being more ideal for triple channel memory.


    So I'm curious as well as to what effect it has.
     
  4. wonderspark macrumors 68040

    wonderspark

    Joined:
    Feb 4, 2010
    Location:
    Oregon
    #4
    As I understand it, it runs in two channels instead of three, so long as the RAM is all matched. Tests have indicated a negligible difference in speed, according to those that have done the tests. I can dig up the thread...
    Edit: I'm having trouble getting the link on my iPhone, but search this forum for a thread from last May called "12 Gb or 16 Gb RAM and my programs," or something like that. They explain it well. Basically, most software isn't written to take advantage of triple channel.
     
  5. Umbongo macrumors 601

    Umbongo

    Joined:
    Sep 14, 2006
    Location:
    England
    #5
    The memory controller can't just interact with any 3 DIMMs in the system on the fly. That is why where the DIMMs are installed sets how it operates.

    I don't know if this link makes it any clearer for you. Look up how memory channels work or interleaving data if you want more information or to get a better grasp of things.
     
  6. Loa macrumors 65816

    Loa

    Joined:
    May 5, 2003
    Location:
    Québec
    #6
    I'm one of the testers. I did 6GB (3X2GB) and 8GB (4X2GB). In every real world tests, the 8GB came up faster in PS4.

    http://forums.macrumors.com/showthread.php?t=736539

    I'd be curious to see those same tests done in 10.6, with PS5, but I'm pretty sure that they'd be the same. Very very few apps can saturate the RAM bandwidth, and unless your app can do that, you won't see any benefit of using less RAM in triple channel.

    Loa
     
  7. trankdart thread starter macrumors member

    Joined:
    Jul 28, 2010
    Location:
    Los Angeles, CA, USA
    #7
    Thanks to those who replied, especially Umbongo. The diagram at the link in his post seems to clear it up. it doesn't seem like there should be any penalty for using all four memory slots.

    The drawback to having four slots appears to be just a restriction to less RAM than the computer could otherwise use with the same 3-channel bandwidth if six slots/DIMMS were present, e.g. it would go up to 48GB instead of 32GB with 8GB sticks.

    In fact OWC's test charts show that performance on huge Photoshop test jobs improves going from 12GB to 16GB on a Quad 2009 MP, confirming Loa's answer above.

    I did find the "12gb vs. 16gb" thread refered to by wonderspark. That thread was interesting but unfortunate, since one poster was promoting the rather terrifying idea that application-level software has to somehow be modified to take advantage of the Nehalem 3-channel memory architecture in the same way it has to be optimized for threading. This is 1000% false. Individual memory channel operation is transparent to all software including the operating system and the EFI, let alone applications. Overclocking the entire memory subsystem doesn't count...channels can't be overclocked individually.


    TD
     
  8. nanofrog macrumors G4

    Joined:
    May 6, 2008
    #8
    You probably saw my posts. Take a look here, and hopefully you may see what I was getting at (namely that the compiler used isn't actually designed for the Nehalem architecture). Pay particular attention to the following sections:
    • Factors affecting optimization
    • Intended use of the generated code
    For example, let's say the software's compiled on a tool designed for some form of FSB. As that particular architecture is different from the DDR3 controller, the Pipeline structures aren't optimized for the DDR3 controller on the newer CPU's. So I/O is handled less efficiently than it could be, and has a penalty in throughput. It will still generate a result (work), just not as quickly as the actual hardware being used is capable of generating it (assuming the actual application can benefit from the memory architecture in the first place, as many can't).

    Due to the need for code to run on multiple CPU's (backwards compatibility), this is usually considered an acceptable compromise, as the application can be purchased and used by more buyers.

    But in cases where flat out performance is necessary on the newer architecture (such as Nehalem or Westmere based servers), either the compiler has to be re-written to accomodate the newer architecture, or IO pipeline bottlenecks in the compiler addressed in the application itself (less than ideal).

    Hopefully you understand the point I was trying to make. :)
     
  9. trankdart thread starter macrumors member

    Joined:
    Jul 28, 2010
    Location:
    Los Angeles, CA, USA
    #9
    Yes, I was referring to your posts, but I'm afraid I don't understand the point you were trying to make.

    I might understand it better if you could give me an example of a compiler that generates different code when it knows the program it is compiling will run on a system with an FSB. I understand how a compiler might arrange machine instructions in a different order if it knew that the program might run on a hyperthreaded core. But I don't understand what it would do differently for a triple-channel memory subsystem as opposed to an FSB.

    In fact, I believe that it would do nothing differently because there is nothing that can be done, and that therefore no compiler exists that takes into account the presence of an FSB.

    If you can point me to one, or at least tell me how it operates, I'll be glad to learn something new.

    TD
     
  10. nanofrog macrumors G4

    Joined:
    May 6, 2008
    #10
    You set it via flags, and the age and capability of the compiler matters (i.e. older versions won't have the optimizations for newer CPU architectures). The -xT flag for example (Intel C++), which was meant for Intel Core2 Duo processors, Intel Core2 Quad processors and Intel Xeon processors with SSSE3 (i.e. 2006). Specifically, this flag is meant to assist with memory access.

    Here's an article comparing which (Sun Free Studio v. GCC) is faster for Nehalem. Perhaps this might shed a little light for you, as it contrasts why GCC lost out at the time the article was written (simple article, not deep at all).
     
  11. deconstruct60 macrumors 604

    Joined:
    Mar 10, 2009
    #11
    It isn't just different code, it would have to be different code that makes a difference. With processors like the current Intel x86 there is lots of dynamic instruction reordering done in the decode/dispatch process.

    You might change

    load mem1
    do some instr.
    load mem2
    do some instr.


    versus
    load mem1
    load mem2
    do some instr.
    do some instr.

    but they basically execute in same time. So the triple channnel memory will add non stalling pipeline fetches in several places ( so you don't have to hoist instructions to what you think is a pipeline stall ), but the dynamic scheduling can take care of that also. In other words the optimizer doesn't have to have prefect knowledge of what the dynamic pipeline delays are.

    If these were strictly in-order processors and all the optimization/reordering had to be done statically then perhaps would be a bigger deal of what the exact predictions for the pipeline stall percentages and cache/memory delay stalls are. With dynamic adjustments, you can be slightly off and still run full speed. Hence, the reason why Nethlem and Westmere processors generally do better than previous single controller archs even at slightly lower core clock speeds.



    What are those? If referring to how might string out Load/stores a bit because these are partitioned (i.e., a smaller subset) then there are some similarities. If know addresses can be kicked off in a pipelined fashion (can increase the number of outstanding requests with no ill effects) can reshuffle the instruction stream. (whereas in the hypercode context may want to stretch them out slightly.)
     
  12. deconstruct60 macrumors 604

    Joined:
    Mar 10, 2009
    #12
    Not sure why not because that one, like several other here, tests the non effect when introduce significantly more disk accesses. Since disk accesses are an order of magnitude slower, they totally swamp any differences in memory performance. In short, they don't measure differences in memory set ups.

    Unless, 10.6 crumbles the effectiveness of the disk buffer cache ( which more memory not used by app is used by the OS to cache more disk content) then probably will be the same.



    Hyperthreading wouldn't work if that were true. Fact is that 10-30 instructions with a modestly high rate of load/store instructions can saturation. The lengthly delay waiting on memory is what is used to allow resources from another hyperthread that are local to the CPU to start up and get work done while waiting.


    And yet in several benchmarks done here the 2.26 Nehalem does better than the 2.8 Hapertown :

    http://www.macworld.com/article/139507/2009/03/macpro2009.html

    Having more than one has an effect. Whether you can milk every last drop out is a much more a matter if your app can avoid tapping the disk too often than whether get 2 or 3 channels going with a set of pipelined requests.
     
  13. VirtualRain macrumors 603

    VirtualRain

    Joined:
    Aug 1, 2008
    Location:
    Vancouver, BC
    #13
    Deja vu... We had this debate before... :) there are no code optimizations needed to take advantage of different physical memory configurations, just as there aren't any code optimizations required for the OS to take advantage of different disk storage systems (eg. RAID). Memory and disk storage systems are abstracted up to the OS and Applications.
     
  14. trankdart thread starter macrumors member

    Joined:
    Jul 28, 2010
    Location:
    Los Angeles, CA, USA
    #14
    Yes, sorry for the noise Virtual Rain. I saw your earlier thread and understand that you've attempted to explain it to them already. It's enough of this, it's hopeless. But at least I got my question answered: it won't hurt to use all four DIMM slots. That part was helpful, at least to me. :)

    TD
     
  15. Gonk42 macrumors 6502

    Joined:
    Jan 16, 2008
    Location:
    near Cambridge
    #15
    One curiosity of the Nehalem architecture is that to get full RAM speed (for those chips capable of the 1333MHz RAM) required only one RAM chip per channel so adding a fourth would slow the whole system.

    BUT this wouldn't affect the Mac Pro anyway as the RAM is restricted to the slower speed anyway, also Westmere probably doesn't have this restriction. Even with Nehalem, manufacturers such as Sun were producing systems that would run at 1333MHz with more than one slot filled per channel.

    Even with the RAM speed being slower, the latency is not much different so the overall speed change is probably negligible.
     
  16. wonderspark macrumors 68040

    wonderspark

    Joined:
    Feb 4, 2010
    Location:
    Oregon
    #16
    Interesting. It's over my head, but I'm glad to hear that the answer remains the same... Four sticks good. :)
     
  17. Loa macrumors 65816

    Loa

    Joined:
    May 5, 2003
    Location:
    Québec
    #17
    Well you're comparing a lot more than dual and triple channels here!

    Show me a test that shows PS5 working significantly faster using triple channel on a 2009 MP, and I'll do it myself! (something like 6GB Vs 8GB, or 12GB vs 16GB)

    Loa
     
  18. Phantom Gremlin macrumors regular

    Joined:
    Feb 10, 2010
    Location:
    Tualatin, Oregon
    #18
    Don't confuse latency with bandwidth.

    The reason hyperthreading is usually a net performace improvement is because it works around high latency., i.e. it gives the CPU something else to do after a cache miss. If you are instead out of bandwidth, there is no performace gain from running more concurrent processes (i.e. no point to hyperthreading).

    This is not just simple semantics. This is a very very important distinction. It's (relatively) trivial to increase bandwidth, just add more parallel memory channels like Intel did. It's very hard to reduce latency. That's why recent Intel chips have 3 levels of cache and also have sophisticated cache management algorithms.

    Finally, I agree with the GP poster. Most apps don't come close to saturating the RAM bandwidth. When Intel moved the memory controller off the north bridge, it really improved the bandwidth (and also btw really improved latency).
     

Share This Page