12GB vs 16GB of RAM and my Programs?

mjones185 · May 7, 2010

Hello All,

I'm about to purchase RAM for my 2009 4 core Mac Pro. I've read about losing triple channel something other if I use 4 sticks of RAM vs three.

Is the advantage greater going with 16GB vs the disadvantage with losing triple channel?

I use my Mac for Aperture and the Adobe CS5 Master Collection, VMware with several flavors of Windoze and Snow Leopard Server, I don't run more than one virtual OS at a time due to only having 3GB of RAM at the moment, I'd like to be able to run two at a time.

Oh, any thoughts on OWC vs Crucial, other the the significant price difference? Is Crucial justified in their price difference? 12GB: OWC $459 - Crucial $719

Any thoughts/comments would be greatly appreciated, I've read as much as I can find on the double/triple something other but still not confident in my understanding of the process.

Thanks,

Mike

nanofrog · May 7, 2010

For the usage you describe, you won't benefit from triple channel DDR3 (very little does right now actually, and there's no way to know when software will be updated to do so).

So go for capacity over bandwidth.

As per where to buy, OWC is cheaper than Crucial right now from other posts I've seen, and it's usually quite good memory (there has been the occassional exception, but they will stand behind their products).

xparaparafreakx · May 7, 2010

Here are some stats for you.

http://eshop.macsales.com/Reviews/Framework.cfm?page=/Benchmarks/NehalemTests.html

"More memory is always better especially when using multiple programs at once. "

whwang · May 7, 2010

xparaparafreakx said:
Here are some stats for you.

http://eshop.macsales.com/Reviews/Framework.cfm?page=/Benchmarks/NehalemTests.html

"More memory is always better especially when using multiple programs at once. "

I have the same questions whether I should go with 16G or 12G.
The benchmark in the link above is the situation when there is a need to swap the hard disks. Then I believe having more RAM is better. On the other hand, if I am sure my typical work will seldom (or never) saturate the 12G of RAM and swap hard disks, then is 12G better than 16G? Is there benchmark for such kind of situations?

Nanofog, you mention about software. Is the use of triple bandwidth a firmware or OS level thing? I thought as long as the hardware configuration is right, the triple bandwidth will be automatically enabled and the entire system will benefit from it.

nanofrog · May 8, 2010

whwang said:
Nanofog, you mention about software. Is the use of triple bandwidth a firmware or OS level thing? I thought as long as the hardware configuration is right, the triple bandwidth will be automatically enabled and the entire system will benefit from it.

Unfortunately, it's not that simple. The application code has to support it, and most doesn't. Other features applications have to be able to support would be features like Hyper Threading.

It's not automatic because it exists in the hardware, nor will it be "automatic" if the OS has support for it (i.e. provides a framework). The application code still has to be written to utilize it.

As most code out there was written for previous architecture, it's not able to utilize triple channel DDR3. It will take time, and there's no way to know how much will pass before it happens. Software always lags behind the hardware, and is an unfortunate fact that users have to deal with.

Apple Corps · May 8, 2010

I would go for 12 GB (3 x 4GB modules) and OWC all the way.

The current prices on Crucial memory are absurd.

BTW - that opinion comes from someone who has been a lifelong Crucial customer - never any issues with it but the current price premium is - well, no need to repeat.

xparaparafreakx · May 8, 2010

whwang said:
I have the same questions whether I should go with 16G or 12G.
The benchmark in the link above is the situation when there is a need to swap the hard disks. Then I believe having more RAM is better. On the other hand, if I am sure my typical work will seldom (or never) saturate the 12G of RAM and swap hard disks, then is 12G better than 16G? Is there benchmark for such kind of situations?

Nanofog, you mention about software. Is the use of triple bandwidth a firmware or OS level thing? I thought as long as the hardware configuration is right, the triple bandwidth will be automatically enabled and the entire system will benefit from it.

http://macperformanceguide.com/OptimizingPhotoshop-TestResults.html

Lloyd Chambers said:
8GB memory is 45 - 70% slower than 16GB memory.
A single drive is 60 - 220% slower than a striped RAID.
8GB with single drive is 320% slower than the optimal configuration.

Phantom Gremlin · May 9, 2010

Apple Corps said:
I would go for 12 GB (3 x 4GB modules) and OWC all the way.

The current prices on Crucial memory are absurd.

The interesting thing about ECC is you can very quickly find out if you have bad memory bits. That makes it doubly hard to justify a large premium for Crucial RAM. BTW I have Crucial in 3 Macs and they don't crash. Ever.

VirtualRain · May 9, 2010

nanofrog said:
Unfortunately, it's not that simple. The application code has to support it, and most doesn't. Other features applications have to be able to support would be features like Hyper Threading.

It's not automatic because it exists in the hardware, nor will it be "automatic" if the OS has support for it (i.e. provides a framework). The application code still has to be written to utilize it.

As most code out there was written for previous architecture, it's not able to utilize triple channel DDR3. It will take time, and there's no way to know how much will pass before it happens. Software always lags behind the hardware, and is an unfortunate fact that users have to deal with.

Perhaps your answering a different question than I think but utilizing the memory architecture at max performance is not a software function. Every byte fetched from memory by the CPU is done so at the top speed the memory architecture is designed for. Now wether the differences between any two given memory architectures provides significant performance improvements in any given app is another matter and actually has as much to do with the CPUs cache structure as it does with the software. The reason the enormous bandwidth of tri-channel DDR3 is largely unutilized is because CPU cache sizes have increased to the extent that it's rare to experience a cache miss. Even dual channel DDR2 is ample to keep today's CPU caches topped up with data before apps request it.

So the full utilization of DDR3 is not a programming or software function. As I said the CPU is always interacting with memory at full speed. The fact is that large caches make it less relevant and no coding changes are going to do anything about this. Only apps that deal with significant data sets that stress the cache system to it's limits will reveal any benefits to the underlying memory bus architecture and there's no point engineering software to use such large data sets just for the sake of stressing cache. Just the contrary: the best performance will come from apps that work fully within cache rendering memory bus performance irrelevant.

Another way to look at it is like hard drive storage. You don't need to write special apps to gain the benefits of RAID 0 read/write performance. But, large prefetch caches will make the underlying storage systems performance less relevant in all but the most extreme cases. If you put enough cache on a high end Areca controller it's analogous to the large cache in the CPU. It really makes the read/write performance of any drive in the array much less relevant than if there was no cache present. It's the exact same impact cache on the CPU is having on the memory bus. And software does not need to be optimized to take advantage of RAID0 the same as it has no bearing on memory performance.

whwang · May 9, 2010

Thanks for the clarification.

ValSalva · May 9, 2010

My first laptop had a 6GB HDD. Now were talking about almost three times that in RAM. We've come a long way in a dozen years.

nanofrog · May 9, 2010

VirtualRain said:
Perhaps your answering a different question than I think but utilizing the memory architecture at max performance is not a software function. Every byte fetched from memory by the CPU is done so at the top speed the memory architecture is designed for. Now wether the differences between any two given memory architectures provides significant performance improvements in any given app is another matter and actually has as much to do with the CPUs cache structure as it does with the software. The reason the enormous bandwidth of tri-channel DDR3 is largely unutilized is because CPU cache sizes have increased to the extent that it's rare to experience a cache miss. Even dual channel DDR2 is ample to keep today's CPU caches topped up with data before apps request it.

So the full utilization of DDR3 is not a programming or software function. As I said the CPU is always interacting with memory at full speed. The fact is that large caches make it less relevant and no coding changes are going to do anything about this. Only apps that deal with significant data sets that stress the cache system to it's limits will reveal any benefits to the underlying memory bus architecture and there's no point engineering software to use such large data sets just for the sake of stressing cache. Just the contrary: the best performance will come from apps that work fully within cache rendering memory bus performance irrelevant.

Another way to look at it is like hard drive storage. You don't need to write special apps to gain the benefits of RAID 0 read/write performance. But, large prefetch caches will make the underlying storage systems performance less relevant in all but the most extreme cases. If you put enough cache on a high end Areca controller it's analogous to the large cache in the CPU. It really makes the read/write performance of any drive in the array much less relevant than if there was no cache present. It's the exact same impact cache on the CPU is having on the memory bus. And software does not need to be optimized to take advantage of RAID0 the same as it has no bearing on memory performance.

I was dealing specifically with Symetrical Multiprocessing & memory management, while you seem to be focusing on the physical layers.

There's a really simple fact that demonstrates there's a software influence, as if the bandwidth usage was purely a function of the memory controller in the Nehalem parts, then all applications would be able to utilize triple channel operation. But that's not the case at all. Some applications can, most can't.

It has to do with both the compiler (not just architecture optimization, but libraries/functions available that assist in multiple channel memory optimization), and the application code itself (writes code to use all available channels if it doesn't exist in the compiler). Worse yet a poorly written OS could hamper bandwidth utilization as well (assuming the application code doesn't try to bypass the OS's memory management). Unoptimized code is more likely to use a single channel's bandwidth rather than the parallelism that would be possible (and OS's aren't capable of forcing it either).

Unfortunately, it's more likely to find a multi-threaded application that's only using one memory channel, rather than one that can utilize both n cores and n memory channels simultaneously (or a single threaded application that can use n memory channels).

Ultimately, not all software has been written to utilize the bandwidth available on multiple memory channels (simultaneous channel operation, essentially SMP applied to the memory controller itself). Until that changes (software catches up), the full benefit won't be seen by many users, save a few specialized applications.

VirtualRain · May 10, 2010

nanofrog said:
I was dealing specifically with Symetrical Multiprocessing & memory management, while you seem to be focusing on the physical layers.

We are talking about dual/tri-channel memory which is an element of the physical implementation of RAM.

nanofrog said:
There's a really simple fact that demonstrates there's a software influence, as if the bandwidth usage was purely a function of the memory controller in the Nehalem parts, then all applications would be able to utilize triple channel operation. But that's not the case at all. Some applications can, most can't.

Your observation that tri-channel appears to have little impact is correct... however you are in error as to the root cause. Rest assured that with 3 identical DIMMS, the memory controller is interleaving across all three channels for all memory operations... all the time. All applications are using triple channel, you just don't see any difference between tri-channel and dual-channel in real-world apps because the CPU cache size is sufficient to fully contain the working set for most apps, and with a good prefetch engine as most modern CPU's have, a cache miss is a rarity.

As I continue to say, the concept is identical to striping across multiple disks in a RAID0 array. Clearly things scale as you add more disks... an array of 3 disks has better performance than an array of 2 disks. And of course, data is stripped across all 3 disks all the time. It's not a function of software. But the difference between a 2 or 3 drive array is subtle for most apps, it takes specialized benchmarks to reveal the true difference. Now, if you introduce a RAID controller with a huge cache, the differences between a 2 and 3 disk array get even more difficult to measure for most applications. If the cache size is big enough, it will make the RAID array seem to perform amazingly, regardless if the array has 2 or 3 drives.

The same effect is going on with CPU cache... it's so large that it makes dual-channel DDR2 look impressive, and pumping the memory bandwidth up with tri-channel DDR3 has little impact because the cache already contains everything the core logic needs to keep operating at peak efficiency. Only specialized benchmarks that bypass cache can measure the true bandwidth difference. The real-world effect of going from dual to tri-channel with such huge CPU caches is almost negligible.

nanofrog said:
It has to do with both the compiler (not just architecture optimization, but libraries/functions available that assist in multiple channel memory optimization), and the application code itself (writes code to use all available channels if it doesn't exist in the compiler). Worse yet a poorly written OS could hamper bandwidth utilization as well (assuming the application code doesn't try to bypass the OS's memory management). Unoptimized code is more likely to use a single channel's bandwidth rather than the parallelism that would be possible (and OS's aren't capable of forcing it either).

Unfortunately, it's more likely to find a multi-threaded application that's only using one memory channel, rather than one that can utilize both n cores and n memory channels simultaneously (or a single threaded application that can use n memory channels).

Not at all true. The number of channels is a physical layer element that is abstracted to higher level functions. Software and the OS are not aware and don't care how many physical channels the memory controller is using to interleave operations. This is handled at a much lower level. Again, it's the same as the number of drives in a RAID0 array. The RAID controller knows how many channels there are, but to the OS and Apps, it's just a file system.

nanofrog said:
Ultimately, not all software has been written to utilize the bandwidth available on multiple memory channels (simultaneous channel operation, essentially SMP applied to the memory controller itself). Until that changes (software catches up), the full benefit won't be seen by many users, save a few specialized applications.

Software does not need to be optimized to take advantage of the read/write performance of a RAID0 array of 3 drives nor does it need to be for tri-channel memory. These are hardware level details that are abstracted to higher level functions.

BTW, The benchmarks and apps that make tri-channel DDR3 look good are not optimized in any way. They simply bypass cache (benchmarks) or have working sets that exceed cache size and therefore more readily reveal the underlying memory bandwidth bottleneck. Most apps are incapable of revealing this as their working sets all execute from within cache.

Also, if you are interested the history to this problem goes back to Netburst where Intel needed a way to overcome the limitations of the FSB. They started dedicating more silicon to cache to improve memory performance and with the Core 2 architecture settled on a standard of 2MB of cache per core... twice as much as AMD's architecture required since they had an IMC. Interestingly enough, Intel continues to use more cache than necessary with 2MB per core L3 cache, even now that they have an IMC. Most computer scientists are puzzled by this decision because Intel could now choose to reallocate some of that silicon in favour of improving the performance of core logic without starving the cores for data now. However, they've yet to reverse course on their excessive cache sizes.

nanofrog · May 10, 2010

VirtualRain said:
... however you are in error as to the root cause. Rest assured that with 3 identical DIMMS, the memory controller is interleaving across all three channels for all memory operations... all the time. All applications are using triple channel, you just don't see any difference between tri-channel and dual-channel in real-world apps because the CPU cache size is sufficient to fully contain the working set for most apps, and with a good prefetch engine as most modern CPU's have, a cache miss is a rarity.

I understand what you're getting at. But what I've observed in testing, is where cache is too small to contain the data set, so it has to cycle to RAM.

That's what I can't get, assuming it's all abstract at the OS/application level. The only conclusion I could draw, is the the compiler was written for older processor architecture or the application bypassed something in memory management in the OS (i.e delves into the phy layers for memory usage), and it's not working as well in the Nehalem based systems. Specifically, it seems as if the interleaving process has been affected (uneven data fill; acts like it rotates channels via capacity limits, rather than load balance accross all channels simultaneously)

I guess it's more of a special case, but it's been driving me freakin' nutz. Or I've really missed the boat as to what's going on with the software (not like the vendor's been all the clear about it).

Phantom Gremlin · May 10, 2010

VirtualRain said:
Most computer scientists are puzzled by this decision because Intel could now choose to reallocate some of that silicon in favour of improving the performance of core logic without starving the cores for data now. However, they've yet to reverse course on their excessive cache sizes.

Most "computer scientists" are academicians. It's unlikely that most of them would know the current L1/L2/L3 cache sizes of recent Intel silicon. Most of them wouldn't be interested enough in something as practical as an actual CPU implementation to be "puzzled" by its design tradeoffs. (I majored in CS).

There's a simple explanation for why Intel has large caches. It's because their processes are much finer geometry than AMDs. Which means they can build chips with huge numbers of transistors, and still get high yields.

So then, what should they do with those transistors? Simple. It's easy to increase the size of a cache memory. Not easy in absolute terms, but relatively easy compared to improving "core logic" that's running at 3 GHz and that already is keeping track of 50 or 100 instructions in flight.

We have 2 or 4 or 6 CPU cores on a single die for the very same reason. Effectively using all those transistors in a single CPU core is hard. It's so hard that neither Intel nor AMD knows how to do it. It's much easier to design a simpler CPU core and instantiate 6 of them on a single die. And then it becomes a "small matter of programming" to fully utilize those cores. Which, of course, doesn't happen except for embarrassingly parallel tasks like video transcoding.

Mhkobe · May 10, 2010

For CS4, technically speaking 3 channel is better than more RAM except if you are editing HUGE photos (like 25000x25000 pixel images), the thing is though that photoshop will scarcely need to access more than 4gb of RAM at a time (even in 12000x18000 photos), so triple channel doesn't really help either.

If you are using complex programmes in Windows, like CAD programmes then it is best to go with more RAM, however, anyway you look at it 16gb is a massive amount of RAM for our current programmes (I use C4D, xcode, and Photoshop mainly), so either way you go you are set for a while.

VirtualRain · May 10, 2010

Phantom Gremlin said:
Most "computer scientists" are academicians. It's unlikely that most of them would know the current L1/L2/L3 cache sizes of recent Intel silicon. Most of them wouldn't be interested enough in something as practical as an actual CPU implementation to be "puzzled" by its design tradeoffs. (I majored in CS).

There's a simple explanation for why Intel has large caches. It's because their processes are much finer geometry than AMDs. Which means they can build chips with huge numbers of transistors, and still get high yields.

So then, what should they do with those transistors? Simple. It's easy to increase the size of a cache memory. Not easy in absolute terms, but relatively easy compared to improving "core logic" that's running at 3 GHz and that already is keeping track of 50 or 100 instructions in flight.

I also have a degree in Computer Engineering (FWIW) and don't agree. Large cache does affect yields, particularly at high clocks which is why Intel is forced to run the cache at a lower clock multiplier (to keep these chips affordable) and it also affects the binning process so that higher clocked parts are are at a premium. Cache also generates heat. Even if Intel didn't want to invest the unneeded cache silicon in extra core logic, they could opt to reduce die size and improve yields (reduce cost) and improve thermal efficiency, possibly leading to higher clock speeds. There are always trade offs involved. Extra cache is not free.

My only thought is that with this generation of CPU's Intel is really focused on server requirements and is less concerned with workstation/desktop performance, perhaps thinking that what's good for a server will work just fine on a workstation/desktop.

nanofrog · May 10, 2010

VirtualRain said:
My only thought is that with this generation of CPU's Intel is really focused on server requirements and is less concerned with workstation/desktop performance, perhaps thinking that what's good for a server will work just fine on a workstation/desktop.

This appears overwhelmingly clear, and the trickle-down approach will continue with each new architecture for some time (Intel's published roadmaps show this for the next few years or so anyway, and I expect longer).

Desktop parts that are modified from the server/workstation models are further cut down, such as what happened with the LGA1156 parts compared to the LGA1366 models that kicked off Nehalem's appearance.

Search

Search

12GB vs 16GB of RAM and my Programs?

mjones185

macrumors regular

nanofrog

macrumors G4

xparaparafreakx

macrumors 65816

whwang

macrumors regular

nanofrog

macrumors G4

Apple Corps

macrumors 68030

xparaparafreakx

macrumors 65816

Phantom Gremlin

macrumors regular

VirtualRain

macrumors 603

whwang

macrumors regular

ValSalva

macrumors 68040

nanofrog

macrumors G4

VirtualRain

macrumors 603

nanofrog

macrumors G4

Phantom Gremlin

macrumors regular

Mhkobe

macrumors regular

VirtualRain

macrumors 603

nanofrog

macrumors G4

Our Staff