MP 7,1 Mixing DIMM sizes in the MP7,1

bxs · Jan 3, 2020

Subject: Mixing DIMM sizes in the MP7,1

My current approach for setting up 288 GB RAM is to add 8x 32GB R-DIMMs to my stock 4x 8GB R-DIMMs.

After reading the article here -> https://macperformanceguide.com/MacPro2019-MemoryBandwidth.html I'm a bit concerned my approach is not good wrt memory performance.

Our office performs two major tasks:

Film editing using primarily Adobe software (but will be exploring the use of FCPX as it appears to offer extreme benefits with MP7,1).
Fluid Dynamics simulations (CFD) - Multi-threaded - Intensive CPU/Processor, core, memory, and at times i/o intensive.

My concern is with the 2nd CFD task.

chfilm · Jan 3, 2020

bxs said:
Subject: Mixing DIMM sizes in the MP7,1

My current approach for setting up 288 GB RAM is to add 8x 32GB R-DIMMs to my stock 4x 8GB R-DIMMs.

After reading the article here -> https://macperformanceguide.com/MacPro2019-MemoryBandwidth.html I'm a bit concerned my approach is not good wrt memory performance.

Our office performs two major tasks:

Film editing using primarily Adobe software (but will be exploring the use of FCPX as it appears to offer extreme benefits with MP7,1).

Fluid Dynamics simulations (CFD) - Multi-threaded - Intensive CPU/Processor, core, memory, and at times i/o intensive.

My concern is with the 2nd CFD task.

All I can say is, I got 6x32gb from Samsung and tried to mix it with my pre-installed 4x8gb dimms.
It actually worked in an officially unsupported 10dimm configuration, but geekbench showed a dramatically reduced multicore speed. So I decided to remove the 8gb dimms again.

I think the only valid way to do this would be to get 6x32gb and two additional 8gb and then have 6 and 6 dimms of the same size, one for each controller.
[automerge]1578076094[/automerge]

bxs · Jan 3, 2020

My main concern is for the CFD code we/I run on the MP7,1 with this mix of R-DIMM sizes. The multi-threaded code is very memory intensive.

To solve some of the most complex CFD problems the code is setup to run across 28 cores (our MP7,1 will be a 16-core model). Each core will require 10 GB of RAM(memory). Thus, when this CFD code runs there will be some 8GB of RAM available for other needed system processes that we have no control over.

Each core 'worker-bee' attempts to grab a portion of its 10GB RAM(memory) and keep it in the processor cache (L1 cache is 32K x16 and L2/L3 cache is 38 MB according to the specs I've seen) to chomp on it and when finished moves onto to another chunk of its RAM(memory) to continue the process of solving its assigned task.

At this time it's unclear to me if the current 4x 8GB R-DIMMs mixed in with 8x 32GB R-DIMMs will be less efficient for the CFD code, as I've just explained, than populating all DIMM slots with 32GB R-DIMMs.

If in doubt I could spend the extra $557 and purchase 4 more 32GB R-DIMMs and place the 4x 8GB R-DIMMs on the shelf. This would then provide an even distribution of same size R-DIMMs and allow each core to command 13GB RAM(memory) that in itself allows for more complex/bigger CFD problems to be solved.

AidenShaw · Jan 3, 2020

bxs said:
Each core 'worker-bee' attempts to grab a portion of its 10GB RAM(memory) and keep it in the processor cache

If this is the case, you'll probably see no measurable effects from the mis-match in memory - it may be slightly slower grabbing RAM into the cache, but most of the time it will run at full cache speed.

I hope that the code is smart and senses the cache size on the current processor. If it's a preset value it may run poorly on processors with smaller caches, and not optimally on processors with larger caches.

Does it use processor and NUMA affinity?

defjam · Jan 3, 2020

bxs said:
If in doubt I could spend the extra $557 and purchase 4 more 32GB R-DIMMs and place the 4x 8GB R-DIMMs on the shelf. This would then provide an even distribution of same size R-DIMMs and allow each core to command 13GB RAM(memory) that in itself allows for more complex/bigger CFD problems to be solved.

I'd just buy the extra memory, $557 for 128GB is not unreasonable.

bxs · Jan 3, 2020

The CFD code employs the Message Passing Interface (MPI) and each worker-bee shares its results to the other worker-bees at times.

When I first started using this CFD code it was on an SGI system and yes, NUMA was employed, as the memory was distributed on the SGI system. The SGI system was great fun to work with.

A ref: https://en.m.wikipedia.org/wiki/Message_Passing_Interface
[automerge]1578082103[/automerge]

defjam said:
I'd just buy the extra memory, $557 for 128GB is not unreasonable.

I agree... and I'm seriously thinking of this approach.... However, I really would like to know if I should be worrying about poor memory access times with my original approach of using mixed R-DIMM sizes.

defjam · Jan 3, 2020

bxs said:
I agree... and I'm seriously thinking of this approach.... However, I really would like to know if I should be worrying about poor memory access times with my original approach of using mixed R-DIMM sizes.

Generally speaking the way to maximize memory bandwidth is to utilize the same memory configuration across memory controllers. Typically I would say optimal memory configuration isn't as important as having enough memory for your workload. However you've indicated CFD is sensitive to memory bandwidth therefore my recommendation is to spend the extra $557 to ensure all memory is identical and therefore optimal memory bandwidth is achieved.

erroneous · Jan 5, 2020

I now have 6x32Gbyte 3rd party RDIMM and my original 4x8Gbyte RDIMM on my 16 core MacPro7,1. Below are Geekbench 5 (5.1.0) runs from some memory layouts I set up out of curiousity.

RDIMMs installed	Geekbench 5.1.0 test output	Multi-Core "AES-XTS" result
12x32Gbyte	https://browser.geekbench.com/v5/cpu/1079648	31.8 GB/sec
6x32Gbyte	https://browser.geekbench.com/v5/cpu/949425	30.2 Gbyte/sec
6x32Gbyte + 4x8Gbyte	https://browser.geekbench.com/v5/cpu/949355	18.2 GB/sec
4x8Gbyte	https://browser.geekbench.com/v5/cpu/949048	14.4 GB/sec
2x32Gbyte + 4x8Gbyte	https://browser.geekbench.com/v5/cpu/948942	16.8 GB/sec

If you look at the multi-core tests you'll see several subtests that are clearly very sensitive to memory bandwidth - "AES-XTS", "Navigation", "Machine Learning" and a few others. Do any of them resemble parts of your CFD workload?

I found a Lenovo testing document from 2017 about memory performance on 6 channel Scaleable Xeons in various balanced/unbalanced performance which seems to explain what's going on quite well. The memory throughput ratios in the document seem to roughly match the changes in Geekbench subtest throughput.

The 10 channel layout showed all green in the About This Mac->Memory tab, even though the Apple memory support website states that 10 slot layout is not supported.

Hope that helps.

bxs · Jan 5, 2020

Thank you...

👍

From the Lenovo testing doc it seems to me that for my needed RAM capacity of 288GB I need to divide this by 12 and round this number up to next DIMM size and then populate all 12 slots with the same DIMM sizes.

Thus I have 4x 8GB R-DIMMs as stock, and have purchased 8x 32GB from 3rd party. This gives me my 288GB RAM. However the 288/12 = 24 and rounding this up to 32 means I should populate all 12 slots with 32GB R-DIMMs to obtain a Balanced memory configuration and a Relative performance of 100%.

Thus I will now strongly consider removing the stock 4x 8GB R-DIMMs and purchase a further 4x 32GB allowing me to populate all 12 slots with 32GB R-DIMMs. This provides 384GB RAM capacity that is more than the 288GB RAM I require, but provides the best memory performance.

The document does show for a required 288GB RAM capacity I could use a 6x 32GB and 6x 16GB R-DIMMs as an alternative to obtain a Near-Balanced memory performance of around 97%.

Given that I already have purchased 8x 32GB R-DIMMs then it makes sense for me to by 4 more 32GB R-DIMMs for around $557 vs. 6x 16GB R-DIMMs for around $530 requiring me to waste/sell/shelf two of my already purchased 32GB R-DIMMs.

Given all this I will buy the extra 4x 32GB R-DIMMs and have a RAM capacity of 384GB which actually will allow me to solve larger and more complex CFD problems and having the added benefit of 100% memory performance (Balanced).

defjam · Jan 5, 2020

bxs said:
Given all this I will buy the extra 4x 32GB R-DIMMs and have a RAM capacity of 384GB which actually will allow me to solve larger and more complex CFD problems and having the added benefit of 100% memory performance (Balanced).

This is the best move as you get the full memory bandwidth plus you can never have too much memory

Onelifenofear · Jan 5, 2020

Can I jump in...

Can I get
2x 8gb Hynix ( if I can source them in the uK ) to add to the 32 in the base model
and
6x 32gb

240gb total

As I understand it there are 2 x 6 channels and you need to install in 6's for the top speed - with the different sizes screw that up?

Snow Tiger · Jan 5, 2020

bxs said:
My main concern is for the CFD code we/I run on the MP7,1 with this mix of R-DIMM sizes. The multi-threaded code is very memory intensive.

To solve some of the most complex CFD problems the code is setup to run across 28 cores (our MP7,1 will be a 16-core model). Each core will require 10 GB of RAM(memory). Thus, when this CFD code runs there will be some 8GB of RAM available for other needed system processes that we have no control over.

Each core 'worker-bee' attempts to grab a portion of its 10GB RAM(memory) and keep it in the processor cache (L1 cache is 32K x16 and L2/L3 cache is 38 MB according to the specs I've seen) to chomp on it and when finished moves onto to another chunk of its RAM(memory) to continue the process of solving its assigned task.

At this time it's unclear to me if the current 4x 8GB R-DIMMs mixed in with 8x 32GB R-DIMMs will be less efficient for the CFD code, as I've just explained, than populating all DIMM slots with 32GB R-DIMMs.

If in doubt I could spend the extra $557 and purchase 4 more 32GB R-DIMMs and place the 4x 8GB R-DIMMs on the shelf. This would then provide an even distribution of same size R-DIMMs and allow each core to command 13GB RAM(memory) that in itself allows for more complex/bigger CFD problems to be solved.

If your 16 Core MP7,1's primary task is your fluids program and your workflow is optimized with 10GB of main system memory per core , then it seems to be your best configuration would be to obtain 12 matching 16GB modules for a total of 192GB . That should be enough and with the best possible memory bandwidth . Scrap the 4 x 8GB modules already installed . Sell 'em on eBay .

bxs · Jan 5, 2020

Snow Tiger said:
If your 16 Core MP7,1's primary task is your fluids program and your workflow is optimized with 10GB of main system memory per core , then it seems to be your best configuration would be to obtain 12 matching 16GB modules for a total of 192GB . That should be enough and with the best possible memory bandwidth . Scrap the 4 x 8GB modules already installed . Sell 'em on eBay .

The 12x 16GB provides 192GB. Running CFD over 28 cores/threads each requiring 10GB would then require 280GB and this exceeds 192GB and would reduce each core/thread to only obtaining around 6.8 GB. Plus, this leaves nothing for other system processes that are always swimming around.

At this time I'm sticking with 12x 32GB R-DIMMs and shelve the stock 4x 8GB R-DIMMs for use if ever I need to send the MP7,1 in for Apple servicing/repairs etc.

The 384GB RAM allows 28 cores/thread to each obtain 10GB easily and leave breathing space of 104GB for other things; I suspect the kernel_task will be using as much as 10GB and another large chunk of the 104GB will be used for the dynamic kernel file buffer cache.

My CFD code dumps out periodic checkpoint files to protect its long running wall clock time in the event the hw/sw crashes. These checkpoint files can be very large and as much as 100GB in size and they need to be written out as fast as possible because the CFD application is stalled while the checkpoint file is being written out. For example, writing out some 100GB to a spinner HDD at say 150 MB/s would take some 700 secs, whereas writing 100GB to kernel file buffer cache and onto a fast Sonnet/Samsung 4 bladed RAID-0 would take some 15 to 30 seconds, and quite possibly the 15 seconds just into the buffer cache and letting the system complete the transfer to the Sonnet/Samsung PCIe card while the CFD code is allowed to continue.

The CFD code can run for days at times and checkpoint files are configured to be written out once per hour. So for example, a 7-day run that's some 168 hours of wall clock time. Thus writing checkpoint files of size 100GB each hour to a HDD without use of file buffer cache will consume some 33 hrs of i/o wait time vs. some 1 hr of i/o wait time using the file buffer cache and subsequently to a very fast Sonnet/Samsung PCIe card.

Snow Tiger · Jan 5, 2020

bxs said:
The 12x 16GB provides 192GB. Running CFD over 28 cores/threads each requiring 10GB would then require 280GB and this exceeds 192GB and would reduce each core/thread to only obtaining around 6.8 GB. Plus, this leaves nothing for other system processes that are always swimming around.

At this time I'm sticking with 12x 32GB R-DIMMs and shelve the stock 4x 8GB R-DIMMs for use if ever I need to send the MP7,1 in for Apple servicing/repairs etc.

The 384GB RAM allows 28 cores/thread to each obtain 10GB easily and leave breathing space of 104GB for other things; I suspect the kernel_task will be using as much as 10GB and another large chunk of the 104GB will be used for the dynamic kernel file buffer cache. My CFD code dumps out periodic checkpoint files to protect its long running wall clock time in the event the hw/sw crashes. These checkpoint files can be very large and as much as 100GB in size and they need to be written out as fast as possible because the CFD application is stalled while the checkpoint file is being written out.

I'm a little confused on something . How is it possible to run a 28 Core operation with only 16 Cores physical in real time ? Things will pile up in queue , which would seem to negate the value of all that expensive extra main system memory that cannot be utilized in real time . Caches can only help so much ... they're not intended to replace major components .

bxs · Jan 5, 2020

Snow Tiger said:
I'm a little confused on something . How is it possible to run a 28 Core operation with only 16 Cores physical in real time ? Things will pile up in queue , which would seem to negate the value of all that expensive main system memory that cannot be utilized in real time . Caches can only help so much ... they're not intended to replace major components .

Yes, you're correct.... my bad terminology. I should have stated 28 threads being used, leaving 2 cores free for other system processes. Thanks for the correction.

Snow Tiger · Jan 5, 2020

bxs said:
Yes, you're correct.... my bad terminology. I should have stated 28 threads being used, leaving 2 cores free for other system processes. Thanks for the correction.

OK ! I thought maybe you discovered some real magic here and were leaving me out of the game ! Cores are always a physical component ( think of one of them as a crippled and shrunken stand alone CPU ) . Threads are always virtual . They are two or more optimized data processing streams attached to a single core .

bxs · Jan 5, 2020

Snow Tiger said:
OK ! I thought maybe you discovered some real magic here and were leaving me out of the game ! Cores are always a physical component ( think of them as a crippled and shrunken stand alone CPU ) . Threads are always virtual . They are two or more optimized data processing streams attached to a single core .

Keeping the thread's data as much as possible in the processor L1/2/3 caches is a real challenge and helps minimizes the number of RAM(memory) fetches having to be made. Spilling the data from the caches is an expensive overhead.

Snow Tiger · Jan 5, 2020

bxs said:
Keeping the thread's data as much as possible in the processor L1/2/3 caches is a real challenge and helps minimizes the number of RAM(memory) fetches having to be made. Spilling the data from the caches is an expensive overhead.

I would definitely install 12 matching memory modules in your MP7,1 , then .

Also , since I cannot think of a way to disable the ECC function of the memory in a Mac system , you might consider installing non-ECC memory modules in your Mac if high performance is more important than parity . It'd be your judgement call . You'll gain maybe around a ten percent performance boost , because there is some overhead with using ECC modules . Personally , I'd believe in safety first . But if your back is against the wall here ...

bxs · Jan 5, 2020

Snow Tiger said:
I would definitely install 12 matching memory modules in your MP7,1 , then .

Also , since I cannot think of a way to disable the ECC function of the memory in a Mac system , you might consider installing non-ECC memory modules in your Mac if high performance is more important than parity . It'd be your judgement call . You'll gain maybe around a ten percent performance boost , because there is some overhead with using ECC modules . Personally , I'd believe in safety first . But if your back is against the wall here ...

I prefer the ECC RAM and one reason for using iMac Pro, the MP6,1 and MP7,1 systems. It only takes a few bit picks to cause my CFD code to be unable to converge to a solution.

AidenShaw · Jan 5, 2020

erroneous said:
I now have 6x32Gbyte 3rd party RDIMM and my original 4x8Gbyte RDIMM on my 16 core MacPro7,1. Below are Geekbench 5 (5.1.0) runs from some memory layouts I set up out of curiousity.

Interesting, but remember that these "bandwidth virus" memory tests basically disable the cache to find the raw bandwidth. Real applications are written to optimize cache usage, and won't see nearly performance hit of "bandwidth viruses".

As was posted - "If you look at the multi-core tests you'll see several subtests that are clearly very sensitive to memory bandwidth". I'd think that could be rephrased as "most subtests didn't care about memory layout".

If you don't know that your main apps are bandwidth sensitive, best to ignore synthetic tests.

erroneous · Jan 5, 2020

AidenShaw said:
If you don't know that your main apps are bandwidth sensitive, best to ignore synthetic tests.

Hence my query to the OP “Do any of them [the subtests] resemble your CFD workload?”

MP 7,1 Mixing DIMM sizes in the MP7,1

macrumors 65816

macrumors 68040

macrumors 65816

macrumors P6

macrumors 6502a

macrumors 65816

macrumors 6502a

macrumors member

Attachments

macrumors 65816

macrumors 6502a

macrumors 6502a

macrumors 6502a

macrumors 65816

macrumors 6502a

macrumors 65816

macrumors 6502a

macrumors 65816

macrumors 6502a

macrumors 65816

macrumors P6

macrumors member

Our Staff