nMP & memory performance observations with various mem configs

analog guy · Feb 7, 2014

hey, folks:
just took delivery of a 6c nMP and wanted to post a few observations.

i ran geekbench3 with a number of memory configs and figured i would post the results.

the 16GB chips are crucial RDIMMs.

single-core/multi-core:
3183/4578 (12GB - 3x4 bank #s 1, 2 & 3)
1984/2055 (16GB - 1x16 bank #1)
3096/3916 (32GB - 2 banks filled #s 1 & 3)
3389/5377 (48GB - 3 banks filled #s 1, 2 & 3)
3319/5840 (64GB - all banks filled)

i found it surprising that 3x4 performed far better (single core) than 1x16. i was also surprised that 4x16 took a hit (albeit slight) in single performance over 3x16.

would be curious if someone with a 4x4 configuration tests 4/8/12 & 16.

VirtualRain · Feb 7, 2014

analog guy said:
hey, folks:
just took delivery of a 6c nMP and wanted to post a few observations.

i ran geekbench3 with a number of memory configs and figured i would post the results.

the 16GB chips are crucial RDIMMs.

single-core/multi-core:
3183/4578 (12GB - 3x4 bank #s 1, 2 & 3)
1984/2055 (16GB - 1x16 bank #1)
3096/3916 (32GB - 2 banks filled #s 1 & 3)
3389/5377 (48GB - 3 banks filled #s 1, 2 & 3)
3319/5840 (64GB - all banks filled)

i found it surprising that 3x4 performed far better (single core) than 1x16. i was also surprised that 4x16 took a hit (albeit slight) in single performance over 3x16.

would be curious if someone with a 4x4 configuration tests 4/8/12 & 16.

Congrats on receiving your nMP!

It's good to see that this is measurable. I might have guessed that Geekbench would largely fit in cache and thus not adequately stress the memory subsystem, but this clearly says otherwise.

Keep in mind, the quad channel memory controller interleaves data across channels like a RAID0 array stripes data across drives... so 3x4 should outperform 1x16 handedly since the three sticks offer triple the bandwidth of a single stick.

Another factor may also be that RDIMMs add an added clock cycle of latency vs. UDIMMs.

I'm guessing the anomaly between 3x16 and 4x16 is simply the margin of error in this test.

MacUser2525 · Feb 7, 2014

analog guy said:
single-core/multi-core:
3183/4578 (12GB - 3x4 bank #s 1, 2 & 3)
1984/2055 (16GB - 1x16 bank #1)
3096/3916 (32GB - 2 banks filled #s 1 & 3)
3389/5377 (48GB - 3 banks filled #s 1, 2 & 3)
3319/5840 (64GB - all banks filled)

i found it surprising

I find that multi-core you have listed surprising my brothers mac mini will beat that, i5 model, you sure your not missing a 1 in front of them.

VirtualRain · Feb 7, 2014

MacUser2525 said:
I find that multi-core you have listed surprising my brothers mac mini will beat that, i5 model, you sure your not missing a 1 in front of them.

Those are the memory scores, not the total score.

AidenShaw · Feb 7, 2014

analog guy said:
hey, folks:
just took delivery of a 6c nMP and wanted to post a few observations.

i ran geekbench3 with a number of memory configs and figured i would post the results.

the 16GB chips are crucial RDIMMs.

single-core/multi-core:
3183/4578 (12GB - 3x4 bank #s 1, 2 & 3)
1984/2055 (16GB - 1x16 bank #1)
3096/3916 (32GB - 2 banks filled #s 1 & 3)
3389/5377 (48GB - 3 banks filled #s 1, 2 & 3)
3319/5840 (64GB - all banks filled)

i found it surprising that 3x4 performed far better (single core) than 1x16. i was also surprised that 4x16 took a hit (albeit slight) in single performance over 3x16.

would be curious if someone with a 4x4 configuration tests 4/8/12 & 16.

Could you please post the individual test results as well?

In the other discussion on this topic most of the individual tests were pretty close in performance, but a few greatly benefitted from more channels. It would be interesting to see that data for the full complement of channel populations.

MacUser2525 · Feb 7, 2014

VirtualRain said:
Those are the memory scores, not the total score.

Well that explains it.

sirio76 · Feb 7, 2014

Thanks for your test😉 I would like to see also how things performs in real applications, not only in synthetic benchmark.
I'm still waiting for my 8core, in the mean while I've tried to run a couple of test on one of the nodes I'm using for distributed rendering in Vray(it's an i7-4930K machine, the CPU is nearly identical to the Xeon you find in the 6core nMP). A 10.000.000 polygons scene needs about 10GB to be rendered, render times was exactly the same with both 16 and 32GB 1866DDR3 RAM(8x2 and 8x4 DIMM). Of course I'm expecting that if you fill up all of your RAM with more complex scenes the result will be quite different, but as far as your project fit in memory you should not see significant decrease in performance, at least in Vray. Just my experience, probably there will be many different workload where running in quad channel mode will give you some performance gains.
As soon as I'll get my 8core I'll test rendering performance with different memory configurations(16GBx1/x2/x3/x4, Crucial DIMM).

analog guy · Feb 7, 2014

here is the full info from the memory system sub-tests of GB3 (64-bit) for 3x4, 1x16, 2x16, 3x16 and 4x16.

i labeled the files (you should see that if you save them), but the order appears to be what i listed above.

analog guy · Feb 7, 2014

VirtualRain said:
Congrats on receiving your nMP!

It's good to see that this is measurable. I might have guessed that Geekbench would largely fit in cache and thus not adequately stress the memory subsystem, but this clearly says otherwise.

Keep in mind, the quad channel memory controller interleaves data across channels like a RAID0 array stripes data across drives... so 3x4 should outperform 1x16 handedly since the three sticks offer triple the bandwidth of a single stick.

Another factor may also be that RDIMMs add an added clock cycle of latency vs. UDIMMs.

I'm guessing the anomaly between 3x16 and 4x16 is simply the margin of error in this test.

thanks for that. i was shocked by the magnitude of the difference between 3x4 and 1x16. presumably, 4x4 would provide an even greater difference.

now it makes more sense why the 16GB stock config is not 1x16 or 2x8 (which would make later upgrades easier/more economical for users) -- it's not just a small hit on that subsystem.

CH12671 · Feb 7, 2014

I know you probably don't have the chips to run this test, but I wonder how efficient a 2x8 / 2x4 configuration would be (total = 24 gig). That's how mine will start life, and eventually go to 4x8 for 32 gig....I just don't see a reason to leave 2 slots open and have 3 chips in the drawer while I wait to purchase the remaining 2x8 from crucial....

analog guy · Feb 7, 2014

CH12671 said:
I know you probably don't have the chips to run this test, but I wonder how efficient a 2x8 / 2x4 configuration would be (total = 24 gig). That's how mine will start life, and eventually go to 4x8 for 32 gig....I just don't see a reason to leave 2 slots open and have 3 chips in the drawer while I wait to purchase the remaining 2x8 from crucial....

i don't have 2x8 chips but would love to see your results when you receive your machine.

perhaps i could test 2x16+2x4 = 40GB (or 3x4+1x16=28GB) to see what difference mixing sizes might make.

Umbongo · Feb 7, 2014

analog guy said:
thanks for that. i was shocked by the magnitude of the difference between 3x4 and 1x16. presumably, 4x4 would provide an even greater difference.

now it makes more sense why the 16GB stock config is not 1x16 or 2x8 (which would make later upgrades easier/more economical for users) -- it's not just a small hit on that subsystem.

Well it's just math. 1 DIMM can at most achieve bandwidth of 14.9GB/s, two can achieve twice that and so on. Bandwidth isn't always a key factor in real world performance though and it doesn't look like geekbench is really testing the capacity either so there is that to consider.

----------

analog guy said:
i don't have 2x8 chips but would love to see your results when you receive your machine.

perhaps i could test 2x16+2x4 = 40GB (or 3x4+1x16=28GB) to see what difference mixing sizes might make.

You can't mix UDIMMs and RDIMMS I'm afraid.

Someone testing 8GB DIMMs could also test 2x8GB+2x4GB and the affects of adding an 8GB DIMM to the other three 4GB ones.

analog guy · Feb 7, 2014

Umbongo said:
Well it's just math. 1 DIMM can at most achieve bandwidth of 14.9GB/s, two can achieve twice that and so on. Bandwidth isn't always a key factor in real world performance though and it doesn't look like geekbench is really testing the capacity either so there is that to consider.

yes--simple math but i hadn't done that calculation before so was surprised by the result.

umbongo said:
You can't mix UDIMMs and RDIMMS I'm afraid.

Someone testing 8GB DIMMs could also test 2x8GB+2x4GB and the affects of adding an 8GB DIMM to the other three 4GB ones.

yes, you're right about the UDIMMs and RDIMMs.

would be interesting to compare 4x8 vs 2x16 as well as the mixed 2x8+2x4 vs 1x8+1x4, 2x8, and 2x4 pairing that CH12671 proposed (hope he/she will test and report back).

Lumpydog · Feb 7, 2014

3096/3916 (32GB - 2 banks filled #s 1 & 3)

Ok - I have 32GB as well but in banks 1 & 2.

Should I be using banks 1 & 3??

analog guy · Feb 7, 2014

Lumpydog said:
3096/3916 (32GB - 2 banks filled #s 1 & 3)

Ok - I have 32GB as well but in banks 1 & 2.

Should I be using banks 1 & 3??

i tried both; performance was virtually identical in geekbench for all memory tests.

CH12671 · Feb 7, 2014

He will definitely report back. My nMP will be "shipped in March." So I should have test results by the middle of April😀

AidenShaw · Feb 7, 2014

analog guy said:
here is the full info from the memory system sub-tests of GB3 (64-bit) for 3x4, 1x16, 2x16, 3x16 and 4x16.

i labeled the files (you should see that if you save them), but the order appears to be what i listed above.

Thanks - but can you post the integer and float numbers as well?

It should be expected that tests designed to bypass the cache and stress the memory system would show significant performance benefits from the added channels.

How does it affect other tasks like JPEG and ZIP compression?

analog guy · Feb 7, 2014

AidenShaw said:
Thanks - but can you post the integer and float numbers as well?

i have all the data from the full test. i can post it later. any other requests so i can gather it in one post?

analog guy · Feb 7, 2014

AidenShaw said:
Thanks - but can you post the integer and float numbers as well?

It should be expected that tests designed to bypass the cache and stress the memory system would show significant performance benefits from the added channels.

How does it affect other tasks like JPEG and ZIP compression?

full results:
12GB (3x4)

16GB (1x16)

32GB (2x16)

48GB (3x16)

64GB (4x16)

Lumpydog · Feb 7, 2014

analog guy said:
i tried both; performance was virtually identical in geekbench for all memory tests.

Awesome - thx!

analog guy · Feb 7, 2014

turns out that my 6-core @ 3,671 (the run i did with 48GB RAM) is the nMP with the highest single-core score for GB3/64-bit.

here are all nMPs who ran the 64-bit test.

the highest-scoring 2013 iMac i7 (run w/ 16GB) scored 13% higher (4,146 to 3,671) -- with 12% higher integer, 9% higher floating and 20% higher memory performance.

AidenShaw · Feb 7, 2014

analog guy said:
full results:
12GB (3x4)

16GB (1x16)

32GB (2x16)

48GB (3x16)

64GB (4x16)

Thank you. Thank you very much.

This looks like the earlier report - the number of DIMMs is almost irrelevant for many of the tests. SHA1 multicore and SHA2 multicore were faster with 1 DIMM than with 4 (but probably within the sampling error - hey 'analog guy', wanna do 20 runs and give us the mean and standard deviation for every component score? 😉 ).

Looking at the group scores:

Code:

                        1 DIMM  2 DIMM  3 DIMM  4 DIMM
                       ------- ------- ------- -------
Floating Point Single    3825    3826    3828    3836
Floating Point Multi    25531   25555   25529   25522
Integer Single           3625    3641    3655    3646
Integer Multi           20959   22686   23768   24282

So,
- virtual 4-way tie on Floating Single
- virtual 4-way tie on Floating Multi-core
- virtual 4-way tie on Integer Single
- 1 DIMM is 86% of 4 DIMMs on Integer multi - but if you removed AES and Dijkstra you'd have a virtual 4-way tie, the rest of the integer multi tests were virtual ties

Those L3 caches do seem to be effective on "non-bandwidth virus" programs.

analog guy · Feb 7, 2014

AidenShaw said:
...

thanks for your analysis.

what do you think of the relevance of the stream copy/scale/add #s where 3x4 outperforms 1x16?

AidenShaw · Feb 7, 2014

analog guy said:
thanks for your analysis.

what do you think of the relevance of the stream copy/scale/add #s where 3x4 outperforms 1x16?

STREAM is a "bandwidth virus" benchmark designed to defeat all caches and measure the raw memory bandwidth of the system. It does nothing useful.

IMO, it is interesting for people trying to get into the Top500 Supercomputer list, but mostly irrelevant for anyone considering an Apple, Windows or Linux desktop system.

Most apps benefit from cache, and Intel is currently looking at 2 MiB to 2.5 MiB cache per physical core as the sweet spot. The GeekBench numbers show that is a good decision for almost all of the tests in GeekBench. There are probably some useful desktop apps that need extreme bandwidth, but not many.

One thing that I was happy to learn from this discussion is that AES encryption is one of the bandwidth intensive apps. I'm buying systems for an application gateway prototype which will use 20-core systems to do SSL (AES) encryption. I've learned that populating each system with 8 DIMMs is the way to go. (Some systems only need 32 GiB, so they'll get 8x4GiB.)

VirtualRain · Feb 8, 2014

analog guy said:
full results:
12GB (3x4)

16GB (1x16)

32GB (2x16)

48GB (3x16)

64GB (4x16)

Thanks for taking the time to do this.

AidenShaw said:
Thank you. Thank you very much.

This looks like the earlier report - the number of DIMMs is almost irrelevant for many of the tests. SHA1 multicore and SHA2 multicore were faster with 1 DIMM than with 4 (but probably within the sampling error - hey 'analog guy', wanna do 20 runs and give us the mean and standard deviation for every component score? 😉 ).

Looking at the group scores:

Code:

1 DIMM 2 DIMM 3 DIMM 4 DIMM ------- ------- ------- ------- Floating Point Single 3825 3826 3828 3836 Floating Point Multi 25531 25555 25529 25522 Integer Single 3625 3641 3655 3646 Integer Multi 20959 22686 23768 24282

So,
- virtual 4-way tie on Floating Single
- virtual 4-way tie on Floating Multi-core
- virtual 4-way tie on Integer Single
- 1 DIMM is 86% of 4 DIMMs on Integer multi - but if you removed AES and Dijkstra you'd have a virtual 4-way tie, the rest of the integer multi tests were virtual ties

Those L3 caches do seem to be effective on "non-bandwidth virus" programs.

Moral of this story? Cache is king! 😎

nMP & memory performance observations with various mem configs

macrumors 6502

macrumors 603

Suspended

macrumors 603

macrumors P6

Suspended

macrumors 6502a

macrumors 6502

Attachments

macrumors 6502

macrumors 6502

macrumors 6502

macrumors 601

macrumors 6502

macrumors 6502

macrumors 6502

macrumors 6502

macrumors P6

macrumors 6502

macrumors 6502

macrumors 6502

macrumors 6502

macrumors P6

macrumors 6502

macrumors P6

macrumors 603

Our Staff