Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
Searching for "nehalem unbalanced memory" I couldn't find reference, other than an un-sourced post on apple's site, that suggests each channel having the same capacity makes it balanced again and thus gives you full bandwidth. Maybe because people aren't trying it or I'm not searching for the right thing, but there should be some reference on it.

I found the following two links that support my understanding of how adding the 4th DIMM changes things. I can see why 2-2-1/1 might work, but as 2-2-4 gives triple channel, I don't see why 2-2-2/2 would take a 25-30% hit if this were correct.

http://en.community.dell.com/blogs/.../04/08/nehalem-and-memory-configurations.aspx

http://blogs.sun.com/melvinkoh/entry/nehalem_memory_configuration
 
If you want to use maximum RAM capacity you can fill all four slots with 8 GB RDIMM sticks but bandwidth will drop considerably (33%) due to the fact you drop one memory channel from 3 to 2.
.

You keep stating that using 2 channels will drop the bandwidth by 33%. Do you see this in real life applicaitons? Google for DDR3 in Dual Channel or somehting like that. You will find lots of benchmarks that would show you that the difference between dual channel and tripple channel config in real life apps is negligent, 2-3%. RAM latency and speed plays more significant role though. Hence I was very keen in your efforts to use 1333 RAM.

Here some examples:

http://www.techreport.com/articles.x/15967/1
http://www.overclock.net/intel-memory/419118-ddr3-dual-channel-vs-triple.html

There is very good benchmark here, but you have to take my word on it. I bet you can only read charts :)

http://www.ixbt.com/cpu/intel-ci7-mem.shtml
 
I must admit that the band width drop for asymmetric population is hear say from my part. It could be less than 30% but it is certainly not negligeable due to the massive time imbalance between reading channels 1/2 and 3. The memory controller always drops to the slowest read out. If it takes x nanoseconds to read a cycle from ch3 it will slow down ch1 and ch2 also to that speed because it will run all channels in sync.
 
What I do think might be helpful is if we were to compile some sort of list of the best configurations of memory modules in regards to the Mac Pro that can just be pasted in to threads of this kind.

Seconded. Something easy for us designer/non-techies to understand. ;)

Like (NOT THE REAL LIST - JUST EXAMPLES)

GOOD SETUP
1-1-1-0
2-2-2-0
4-4-2-2
2-2-1-1

BAD SETUP
2-2-2-2 (unbalanced)
1-1-1-4
4-4-1-1

etc, etc.
maybe give reasons why things are in the "bad setup" column.

Thanks in advance. If researched correctly, this is something that could go in the RAM wiki...
 
I must admit that the band width drop for asymmetric population is hear say from my part. It could be less than 30% but it is certainly not negligeable due to the massive time imbalance between reading channels 1/2 and 3. The memory controller always drops to the slowest read out. If it takes x nanoseconds to read a cycle from ch3 it will slow down ch1 and ch2 also to that speed because it will run all channels in sync.

As a matter of fact, latency on dual channel config is much better than the one on a triple channel config with similar RAM modules. But let me step back for a sec...

First of all, check out the attached test results for a real life app (do not know how to embed an image, sorry). Triple channel config makes almost no impact in Call of Duty 4. There are many similar tests out there with real life apps, just google for it.

Second, the read is also not "sequential" as I understand it but "parallel". You have 8.5GB/s per channel in a single ch mode with 1066MHz. Tripple channel is suppose to give you 25.5MHz. At least in theory. So, in essence, you were right, the lack of the third channel would be a 33% degradation. Again, in theory. If you look at synthetic benchmarks, you can see performance difference between dual and triple channel at somewhat 10-20%. In favor of triple ch of course. With real life apps - see above. Not much of a difference.

Moreover, as tests show, the latency with triple channel is worse than with dual channel. Significantly worth.

Look at the last chart (link below), it shows about 10% latency degradation when a third module is added. Why? I have no clue to be honest with you.

http://www.techreport.com/articles.x/15967/3

What would be great though if you could run some tests first hand - you seem to have the most advanced config here and have very good access to spare parts. I would be interesting if you see any performance differences with the apps you use.
 

Attachments

  • CofD4.bmp
    128.7 KB · Views: 100
I don't believe that games are band width sensitive apps at all. Typically you depend of the GPU and to a lesser degree of the CPU but not on memory speed.

http://www.advancedclustering.com/company-blog/stream-benchmarking.html

stream-results.png


This article and picture shows how bandwidth compares in different systems. You can see that compared to a 2008 MP (Xeon 5400) the Nehalem (Xeon 5500 @ 1066MHz) has 360% memory speed. So unless your old system was band width constrained most apps will nicely run with full speed on 66% of the available band width. Hey, this is still 240% of what the old 2008 MP managed. With that in mind I would happly fit 4 identical DIMMs unless I happen to run a simulation of the entire North Atlantic weather system.

To test memory you should run typical memory benchmarks like the stream test mentioned in that article or apps that are known memory performance hogs like video rendering. We certainly have experts here who can identify tasks that push memory to the limits much better than I can do this. Typically that would be high performance computing jobs that run all cores at capacity. Computational fluid dynamics could be such jobs or other scientific and engineering computation with finite elements like deformation and crash behavior.

I was more interested to unlock the artificial limits that Apple has build into my system by crappy firmware. Unfortunately there is no way to change anything there unless you know Apple firmware which I certainly do not do.
 
This article and picture shows how bandwidth compares in different systems. You can see that compared to a 2008 MP (Xeon 5400) the Nehalem (Xeon 5500 @ 1066MHz) has 360% memory speed. So unless your old system was band width constrained most apps will nicely run with full speed on 66% of the available band width. Hey, this is still 240% of what the old 2008 MP managed. With that in mind I would happly fit 4 identical DIMMs unless I happen to run a simulation of the entire North Atlantic weather system.
Do you happen to know if the 5400 series was tested at single or dual channel mode?

I was more interested to unlock the artificial limits that Apple has build into my system by crappy firmware. Unfortunately there is no way to change anything there unless you know Apple firmware which I certainly do not do.
No one outside of Apple knows Apple's EFI firmware. If anything works, it's because it was developed around the standard EFI 1.10 spec, and it didn't have problems with the code sections that Apple customized. It might work to make a util to access the settings (some being more likely than others IMO), but it's a "shot in the dark" approach without the full details.

Otherwise, it would require Reverse Engineering Apple's firmware. Since it's their IP, anyone who posted something that used it (free or otherwise), would almost certainly be nailed for IP infringement.
 
Do you happen to know if the 5400 series was tested at single or dual channel mode?

architectures.xeon5400.png


The Xeon 5400 "Harpertown" processor is a essentially two dual core processors on one physical CPU package. Each physical CPU share a connection to the front side bus. There is one memory controller per system and it is part of the system's chipset called the memory controller hub or MCH. The MCH provides all of the physical access to the system's 667MHz or 800MHz fully buffered DIMM's (FBDIMM's). This shared memory controller and FSB create a bottleneck in memory bandwidth performance.

The text and the pic looks like the MCH runs two channels per CPU with two DIMM slots per channel. But it also states that this isn't the deciding bottleneck. That is obviously the MCH part of the chipset. I believe two chipsets were used with the Harpertown, the 5100 (667 MHz) and the 5400 (800-1600MHz) chipset. The 5100 MCH has one channel per CPU and is obviously not pictured or measured here. So the answer is the figures are for dual channel memory. That was also the MCH used by Apple because they consistently used 800 MHz RAM.

http://ark.intel.com/Compare.aspx?ids=34474,33142,

Name Intel® 5400 Memory Controller Intel® 5100 Memory Controller

Supported FSBs 1600MHz / 1333MHz / 1066MHz / 800MHz 667MHz, 1066MHz, 1333MHz

Max Mem. Size (dep. on mem. type) 128 GB 48 GB

Memory Types DDR2 FB-DIMM DDR2-533 / DDR2-667 Registered ECC DIMMs

# of Memory Channels 4 2
 
The text and the pic looks like the MCH runs two channels per CPU with two DIMM slots per channel.
I know it can run as a dual channel system (some might call 2x dual channel a quad, but that's not really accurate, as quad channel operation isn't possible to a single CPU; only dual). But even for dual channel, the numbers seemed a bit low is all.

But it also states that this isn't the deciding bottleneck. That is obviously the MCH part of the chipset.
It's the interaction between the MCH and FSB. Personally, I've always looked at the MCH having more responsibility, as it has to take the memory, it's own PCIe lanes, and the IOH and route it all through the FSB channel/s (depending on SP or DP system, as there's only one FSB per processor).

I believe two chipsets were used with the Harpertown, the 5100 (667 MHz) and the 5400 (800-1600MHz) chipset. The 5100 MCH has one channel per CPU and is obviously not pictured or measured here. So the answer is the figures are for dual channel memory. That was also the MCH used by Apple because they consistently used 800 MHz RAM.
IIRC, yes it was possible, and used on some boards to save on costs (the 5100 would have been cheaper to use, particularly if the board was intended to run with the 1333MHz FSB Harpertown parts).
 
I know it can run as a dual channel system (some might call 2x dual channel a quad, but that's not really accurate, as quad channel operation isn't possible to a single CPU; only dual). But even for dual channel, the numbers seemed a bit low is all.

As they didn't offer up a test system I'm guessing they got the results elsewhere. The numbers look like 8 DIMMs @ 800MHz and 4 DIMMs @ 667MHz.
 
Just to wade in here... :p

I don't think there's any conclusive evidence indicating how the IMC handles four populated memory slots. It would be nice if Intel clearly articulated how it works. It would also be nice if 4-4-2-2 resulted in tri-channel operation, but I recall reading in the Intel docs that four DIMMS = dual channel.

However, I think it's moot as KG2002 points out, there are few apps, including most OSX benchmarks that can saturate tri-channel memory bandwidth (find one if you can). This is almost entirely due to the fact that the L3 cache sizes are so (relatively) large as to completely mask any memory bottlenecks in most apps.

Oddly, one of the benefits of Intel's move to an IMC and tri-channel was that they could spend less silicon real-estate on cache without impacting performance (and thus decrease die size or dedicate more transistors to core logic) but alas, they have not exploited this benefit and still run extreme L3 cache sizes by any measures.
 
As they didn't offer up a test system I'm guessing they got the results elsewhere. The numbers look like 8 DIMMs @ 800MHz and 4 DIMMs @ 667MHz.

No,

the 5400 @ 667 is the 5100 chipset with ? DIMMs
the 5400 @ 800 is the 5400 chipset with 8 DIMMs

at least that is how I read it.
 
No,

the 5400 @ 667 is the 5100 chipset with ? DIMMs
the 5400 @ 800 is the 5400 chipset with 8 DIMMs

at least that is how I read it.

There are 1333MHz 5400 processors that are limited to 667MHz memory, the chipset supports 533, 667 and 800.
 
As they didn't offer up a test system I'm guessing they got the results elsewhere. The numbers look like 8 DIMMs @ 800MHz and 4 DIMMs @ 667MHz.
Quite possible they didn't do the testing themselves.

My point, is the numbers shown for the Harpertowns just looks really low. I know DDR3 in triple channel is faster, but not that sigificant a difference between the two architectures (3x+).

I seem to recall it's more along the lines of ~22 - 24GB/s for the 800MHz memory, and ~18 - 21GB/s for 667MHz DDR2-FB-DIMM though (my numbers may be slightly off, but not that drastically). Assuming I've not totally lost my mind... :p

Just to wade in here... :p

I don't think there's any conclusive evidence indicating how the IMC handles four populated memory slots. It would be nice if Intel clearly articulated how it works. It would also be nice if 4-4-2-2 resulted in tri-channel operation, but I recall reading in the Intel docs that four DIMMS = dual channel.
They may have avoided that issue to some extent due to the possibility of moving data off of one IMC to the other CPU via QPI. That's rather slow, as the pair of QPI's are 8x PCIe Gen 2.0 lanes total (4x lanes each, but combined = 4GB/s max).

But for SP systems (or DP systems using the entire IMC per CPU; nothing across QPI), it is the case. I do recall reading that in one of the docs.

Oddly, one of the benefits of Intel's move to an IMC and tri-channel was that they could spend less silicon real-estate on cache without impacting performance (and thus decrease die size or dedicate more transistors to core logic) but alas, they have not exploited this benefit and still run extreme L3 cache sizes by any measures.
I don't see this as odd at all, but the goal. They alloted more transistors to the controller (the savings from the IMC and cache went to the new features). Now the die shrink of the 32nm allows for some additional features, and still requires a smaller area (to produce a higher yeild per wafer). Rather smart actually.
 
I don't see this as odd at all, but the goal. They alloted more transistors to the controller (the savings from the IMC and cache went to the new features). Now the die shrink of the 32nm allows for some additional features, and still requires a smaller area (to produce a higher yeild per wafer). Rather smart actually.

Intel was using 2MB/core cache with the C2D FSB architecture to compensate for the constrained memory bandwidth available with that architecture... such enormous cache really gave the architecture much better performance than it deserved giving it longevity well past it's best before date. Now with the move to on-die memory controller (and thus eliminating a serious bottleneck) they still continue to use 2MB/core cache which is really unnecessary now. To use a water reservoir analogy, it's like increasing the capacity of the inlet to the reservoir but still having the reservoir so large that it's almost irrelevant. Intel could easily move to 1.5MB/core or even 1MB/core without impacting data access and use that massive number of transistors for higher level logic (or reducing die-size and TDP perhaps enabling higher clocks).

Keep in mind that AMD's architecture has been using on-die memory controllers since the X2 days (5+ years?) with much smaller cache sizes and they've always been regarded as having superior memory performance for the given technology (DDR/DDR2/etc.).

Intel's move to tri-channel DDR3 really offers little real-world benefit over even dual-channel DDR2 on a FSB because the cache sizes on their old processors were over the top. Yet they haven't reduced them at all with this new architecture. :confused:
 
I don't see a reason to doubt the band width figures from that article for the 5500. You find those figures in other publications from IBM, Dell and Sun who looked into memory optimization. At the same RAM frequency 5500 is 2,6 times faster than 5400.

Obviously the memory system was also designed for the 5600 six core CPU (although in Gulftown apparently with x12) and just looks a bit over blown on a quad. One should not forget that this CPU is primarily designed for servers and that they stuff virtual machines to the hilt on these babies with 144 GB of RAM @ 800 MHz. So even at lower frequency there must be a good reserve to handle those enormous data streams. The figures officially used by Intel are not substantially different. Memory band width has always been the main argument for the Nehalem architecture. Perhaps people here have never looked into it because few would really use the 5500 as a server. For me the take away from this discussion is clear.

  • Apple stole 20% of the bandwidth of my advanced CPU by locking the multiplier at 8 instead of 10.
  • asymmetric memory population on the MP4,1 should be of little concern because band width is rather generous to start with.
  • Apple is systematically forcing people towards octads and higher DIMM density due to the failure to provide 6 RAM slots/socket.

Due to my tests I believe that combining DIMMs for symmetrical memory allocation to the channels does work. Other researchers simply do not focus on that set up because they do not have the rather asinine Apple design. I would encourage those who doubt my results to make their own tests. It is not so difficult. Of course I will bow to adverse results if someone uses a better and more scientific memory measurement method than just Geekbench numbers.
 
Intel was using 2MB/core cache with the C2D FSB architecture to compensate for the constrained memory bandwidth available with that architecture... such enormous cache really gave the architecture much better performance than it deserved giving it longevity well past it's best before date. Now with the move to on-die memory controller (and thus eliminating a serious bottleneck) they still continue to use 2MB/core cache which is really unnecessary now. To use a water reservoir analogy, it's like increasing the capacity of the inlet to the reservoir but still having the reservoir so large that it's almost irrelevant. Intel could easily move to 1.5MB/core or even 1MB/core without impacting data access and use that massive number of transistors for higher level logic (or reducing die-size and TDP perhaps enabling higher clocks).
I understand what you're getting at. But as you noticed, it did give the architecture additional lifespan (carried over from year to year). It saved them on development costs for totally new designs.

To me, this is part of the reasoning for keeping the cache level the same. But it also helps make up for the additional latency in DDR3. Either way, you keep the core running rather than waiting for data, which is critical in high performance server systems (some workstation use as well). Now ATM, the cache ratio may be a little high, but I'd think it will be utilized better in the future parts (necessary via the controller design as it's already implemented).

Keep in mind that AMD's architecture has been using on-die memory controllers since the X2 days (5+ years?) with much smaller cache sizes and they've always been regarded as having superior memory performance for the given technology (DDR/DDR2/etc.).
Yes, they have. Intel carried their old FSB for quite some time, as they plan that long term from what I understand. Just look how long Core 2 survived, and it's not totally dead. The cores themselves, are the same in Nehalem. They only dumped the FSB and related chipsets (made changes, though significant ones). And the LGA1156 parts retain DMI as well.

Intel's move to tri-channel DDR3 really offers little real-world benefit over even dual-channel DDR2 on a FSB because the cache sizes on their old processors were over the top. Yet they haven't reduced them at all with this new architecture. :confused:
There's a little software that can utilize it, but not much. Of what I have direct experience with, some simulation software can. I'd think more commonly VM's could as well, or SMP based applications. Essentially all of it in the enterprise side. For consumer software (home users) though, no. It's overkill. But as the enterprise gear is trickled down into consumer parts, it's inevitable that such parts will obtain the new tech. Now users need software to catch up. As usual. :eek: :p

They've built the architecture to carry them awhile again. So there's plenty of headroom to improve the performance to the limit (they don't give us that the 1st time around). :rolleyes: :p We'll see more cores per CPU and higher memory bandwidths over the parts to come, just as with architectures released before it.

I don't see a reason to doubt the band width figures from that article for the 5500. You find those figures in other publications from IBM, Dell and Sun who looked into memory optimization. At the same RAM frequency 5500 is 2,6 times faster than 5400.
It's just not matching what I seem to recall it capable of, so I'll have to do some looking to make sure I've not gotten something crossed.

  • Apple stole 20% of the bandwidth of my advanced CPU by locking the multiplier at 8 instead of 10.
  • Apple is systematically forcing people towards octads and higher DIMM density due to the failure to provide 6 RAM slots/socket.
Apple definitely raked users over the coals here. There wasn't really any need IMO to make the memory multiplier fixed rather than use SPD as every other system does (EFI doesn't make SPD unavailable). So either it was a ROM capacity issue, or they intentionally bricked the system. They definitely made the DIMM slot issues intentional, as the slots could have been arranged differently to allow for a couple more per CPU (6x per would have been sufficient). But it would have limited sales of the DP systems, and that would make the most sense for doing it.
 
You are recalling the FB-DIMM memory bandwidth wrong nanofrog. They never came close to the theoretical quad channel bandwidth numbers.
 
Just look how long Core 2 survived, and it's not totally dead. The cores themselves, are the same in Nehalem. They only dumped the FSB and related chipsets (made changes, though significant ones).

I do have my doubts that features like hyper threading, tubo boost and energy gates were part of the core 2 architecture. Or have they been disabled? My understanding is that Nehalem really was a step forward to address the server customer issues. Intel claims that the Nehalem servers pay back in as little as 7 months for heavy users due to energy and space density advantages.


Apple definitely raked users over the coals here. There wasn't really any need IMO to make the memory multiplier fixed rather than use SPD as every other system does (EFI doesn't make SPD unavailable). So either it was a ROM capacity issue, or they intentionally bricked the system. They definitely made the DIMM slot issues intentional, as the slots could have been arranged differently to allow for a couple more per CPU (6x per would have been sufficient). But it would have limited sales of the DP systems, and that would make the most sense for doing it.

If you read the manual that you referred me to it says that the machine has one DIMM slot per core. That is a very superficial way to look at things. They had four cores and four DIMMs as well in the MP3,1. So instead of going through the exercise of finding the PCB area for two more slots by a more efficient design of the tray and the heat sinks they decided that four DIMMs were still good enough for what a MP needs to do. I can imagine that nobody really spend the effort to understand memory placement in the Nehalems. It is certainly not as reflected as a Sun, Dell or IBM server when you read the stuff that bloggers post about those systems.
 
You are recalling the FB-DIMM memory bandwidth wrong nanofrog. They never came close to the theoretical quad channel bandwidth numbers.
I couldn't recall if the numbers running around in my head were real or theoretical. That's why I worded it the way I did (the test data I did was some time ago; most recent is more than a year).

If you have the actual data (for comparative purposes), I'd appreciate it. :)

I do have my doubts that features like hyper threading, tubo boost and energy gates were part of the core 2 architecture. Or have they been disabled? My understanding is that Nehalem really was a step forward to address the server customer issues. Intel claims that the Nehalem servers pay back in as little as 7 months for heavy users due to energy and space density advantages.
Those are new features. The cores themselves were used, but the controller was changed extensively (new features you listed and a few others), the IMC was added (replacing FSB), and DMI dumped for QPI (LGA1366 parts only; LGA1156 retains DMI).

I still see it as systems though. Nehalem borrowed a bit from Core 2, and changed out the aspects causing bottlenecks (features added as well, such as VT-d to enhance virtualization, which is in high demand).

Those changes are aimed directly at the enterprise market as you indicate. It's where the real money is for Intel, so they take it seriously IMO.

If you read the manual that you referred me to it says that the machine has one DIMM slot per core. That is a very superficial way to look at things. They had four cores and four DIMMs as well in the MP3,1. So instead of going through the exercise of finding the PCB area for two more slots by a more efficient design of the tray and the heat sinks they decided that four DIMMs were still good enough for what a MP needs to do. I can imagine that nobody really spend the effort to understand memory placement in the Nehalems. It is certainly not as reflected as a Sun, Dell or IBM server when you read the stuff that bloggers post about those systems.
I have read it. They seem to only attempt a simple explaination pertient to the MP. Now whether or not they truly understood it, it's hard to say, but I'd think they did. Also keep in mind, the actual board work is done by Intel as a custom set of parts for Apple.

I do think the system could have been arranged to accomodate another pair of DIMM slots (2x per channel, as most other boards managed). Had it been a single board for a DP, then there was the strong reasoning it wouldn't fit. Just not enough surface area. But risers or daughter boards solve the surface area issue, as you can get an approximation of an E-ATX/SSI EEB area when you add it up.

So the limitations do seem intentional to me in order to push the DP systems on the fact you can use less expensive memory densities to acheive a capacity acceptable for many users (up to 16GB per CPU via 4GB sticks). Beyond that is possible, but not cheap, or common for MP users it seems, from listed capacities most are running.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.