Triple channel, why four slots?

alphaod · Aug 13, 2010

mattmower said:
I'm not sure quite sure how to read this, do you think it means:

85% chance that 8GB will work in the 4/6 core model, 15% that they wont.

or

85% chance that 8GB will work in the 4 core model, 15% chance that it will work in the 6 core model.

?

Matt

Neither: 85% it will work in the model with 4/6 of a core.

mattmower · Aug 13, 2010

alphaod said:
Neither: 85% it will work in the model with 4/6 of a core.

Do you think they come that way from Intel? Or they use a craft knife to slice a chunk off?

M.

xgman · Aug 13, 2010

I just spoke with an OCZ engineer and also Intel tech support about this and they told me that the board would default to single channel mode with all 4 slots occupied. Not even double. The speed would still be 1333, but the benefit of triple channel would be lost. Now I'm thinking maybe using 12GB would be better. Quite frankly a horrible design. Not sure what Intel was thinking this would be good for.

mattmower · Aug 13, 2010

xgman said:
I just spoke with an OCZ engineer and also Intel tech support about this and they told me that the board would default to single channel mode with all 4 slots occupied. Not even double. The speed would still be 1333, but the benefit of triple channel would be lost. Now I'm thinking maybe using 12GB would be better. Quite frankly a horrible design. Not sure what Intel was thinking this would be good for.

Is this different to the 2009 model?

It does sound... bad.

M.

xgman · Aug 13, 2010

mattmower said:
Is this different to the 2009 model?

It does sound... bad.

M.

Not sure about the 2009 model re: triple channel and 4 slots, but the 1333 part is new this year. 8 gb modules working is becoming more critical here. At least you could get 24gb in triple on the single hex board

mchalebk · Aug 13, 2010

This was one of many reasons I opted to buy a 2008 model when the 2009 model was released. There were just too many things that weren't quite right (for me):

1. Price much higher than expected (based on historical data and CPU costs).
2. RAM best if used in 3s, but 4 or 8 slots.
3. One DVI, one mini-DP video outs.
4. No FW400.

None of these were showstoppers in and of themselves. However, they started adding up to make the 2008 model make more sense (once again, for me). Of course, by now, that's no longer a valid option for most people making this decision today.

It always seemed to me that Apple should have found a way to get 6 RAM slots into the single CPU version and 12 slots into the dual CPU version.

reekster · Aug 13, 2010

6Core too

I am also pondering this Q. I am leaning towards the Hex-Core and 3 Memory modules. Based on the link to Mac Performance guide that Eponym posted, (Thanks!) I think I'll wait.
I ordered two 2TB hard drives separately that spec twice the transfer speed of the stock HD in the Mac. I'm trying to build the fastest Mac I can get for Music and Sound Production.

I also ordered an external Blu-Ray drive from OWC. External so I can share it between my different Macs.

robvas · Aug 13, 2010

Would there be too much increase on the power draw if they used 6 or 12 slots, instead of 4 or 8?

I have heard that RAM is taking a large part of of the power a computer uses these days. I know when I configure a Dell server, once I fill a certain amount of memory slots, they require me to upgrade the PS.

Nothing like having a huge bank of memory slots, even if you don't fill them.

(not my picture, from www.prgmr.com)

Umbongo · Aug 13, 2010

robvas said:
Would there be too much increase on the power draw if they used 6 or 12 slots, instead of 4 or 8?

I have heard that RAM is taking a large part of of the power a computer uses these days. I know when I configure a Dell server, once I fill a certain amount of memory slots, they require me to upgrade the PS.

Nothing like having a huge bank of memory slots, even if you don't fill them.

Power isn't the issue with the Mac Pro.

hugodrax · Aug 13, 2010

The Mac Pro Nehelem design is pretty poor when it comes to RAM.

nanofrog · Aug 13, 2010

xgman said:
I just spoke with an OCZ engineer and also Intel tech support about this and they told me that the board would default to single channel mode with all 4 slots occupied. Not even double.

This doesn't sound right at all, from what I've seen. There's sufficient information out there as to what the memory controller does (closest simple information = it drops to dual channel mode due to the unbalanced interleaving).

At any rate, precious little software can actually utilize a triple channel configuration anyway, so it's best to go for sufficient capacity (more important in such a case).

mattmower said:
Is this different to the 2009 model?

As the memory controller on the CPU die is the same architecture between both (2009 & 2010 systems), No.

robvas said:
Would there be too much increase on the power draw if they used 6 or 12 slots, instead of 4 or 8?

I have heard that RAM is taking a large part of of the power a computer uses these days. I know when I configure a Dell server, once I fill a certain amount of memory slots, they require me to upgrade the PS.[/QUOTE]
If you fill enough of them, you may need a larger PSU if the standard unit isn't sufficient for the max slot configuration.

DIMM's don't use but ~10W per. So an 18 slot DP board that's filled would need 180W for RAM.

BTW, most PSU ratings are peak, not continuous ratings (unless explicitly stated, assume the rating is a peak value). So to get a continuous value, divide the Power rating by (SQRT 2)) = 1.414

So a if the MP PSU is rated at 980W, then it's continuous rating (980/1.414) = 692W.

mattmower · Aug 13, 2010

eponym said:
Yes, there's an implication that the limit of 24GB is achieved via 6 slots. It also doesn't help that the Dell 6-core precision workstations come with 6 slots.

See here. We'll know soon enough for sure.

I'm guessing they will work.

I keep going over this in my mind and cannot believe that a machine of this class could be limited to 16GB and, at that, sacrificing between 30%-60% of the memory bandwidth.

I'd put 16GB in it now if I could afford it and can imagine wanting more in a 3+ year time scale.

It seems criminal to me.

Matt

snouter · Aug 13, 2010

mattmower said:
As I understand it the 2010 Mac Pro has a triple-channel memory controller. So I am wondering... Why the single-CPU has 4 memory slots, instead of 3 (or, better yet, 6)? It seems an odd design choice.

mad ghey

there is no other phrase that can explain it

Icaras · Aug 13, 2010

nanofrog said:
At any rate, precious little software can actually utilize a triple channel configuration anyway, so it's best to go for sufficient capacity (more important in such a case).

I know that software utilizes multi-core processors differently, but I had no idea that the software handled different channel memory systems differently.

brentsg · Aug 13, 2010

Icaras said:
I know that software utilizes multi-core processors differently, but I had no idea that the software handled different channel memory systems differently.

I think it's more a matter of actually needing that much memory bandwidth. Video encoding always seemed to fare better on the triple channel configurations.

Icaras · Aug 13, 2010

brentsg said:
I think it's more a matter of actually needing that much memory bandwidth. Video encoding always seemed to fare better on the triple channel configurations.

So logically speaking, this is a fundamental bandwidth vs. capacity argument. And it makes sense that this would be highly dependent on the type of work your doing.

What about for audio? I've ordered the Hex-core and I'm curious as to whether Logic would be better suited with triple channel memory/lower capacity or dual channel/high capacity, as in the case of with video.

nanofrog · Aug 13, 2010

Icaras said:
I know that software utilizes multi-core processors differently, but I had no idea that the software handled different channel memory systems differently.

It's to do with IO transfers (getting data and instructions into the CPU cache as quickly as possible to reduce or eliminate starvation; disk to memory, memory to cache, cache to core). This is more of a means of the application's function than anything else, but the compiler and programmer has a hand in it too (order the instructions and data are fed; which can be affected by specific flags in the compile process, and in some cases, assisted by the programmer by the order they write the code).

But a word processor for example, just won't benefit from multi-threaded operation, nor will it be able to use the memory bandwidth to any real extent either. It could actually be a hinderance, as there's a potential for it to tie up system resources while it waits on the slowest part of the system; the user at the keyboard.

trankdart · Aug 13, 2010

nanofrog said:
It's to do with IO transfers (getting data and instructions into the CPU cache as quickly as possible to reduce or eliminate starvation; disk to memory, memory to cache, cache to core). This is more of a means of the application's function than anything else, but the compiler and programmer has a hand in it too (order the instructions and data are fed; which can be affected by specific flags in the compile process, and in some cases, assisted by the programmer by the order they write the code).

...

This is 1000% nonsense. It's nothing more than a slogan that demonstrates only that you lack even an undergraduate grasp of computer hardware and software architecture. The last time I asked you about this, you posted a link to a page that contained an elementary overview of compiler optimization techniques. That page mentioned code alignment which is indeed an optimization issue that has nothing whatever to do with channels. But it made no mention of memory channels or memory controller architecture or front-side busses or QPI's, since there is nothing an x86 type compiler can do to code that manipulates the use of a memory controller to enhance performance.

Of course, you can prove me wrong if you can:

a) Link to documentation for a compiler flag that makes the compiler do something different for an FSB or any other memory controller layout (a 64-bit vs 32-bit code flag is NOT such an example...can you explain why not?).

b) Link to or include an example of x86-64 assembly language that could perform differently on an FSB vs. an internal QPI-type memory controller. This assumes you are familiar with x86-64 assembly language, which I realize is an extreme longshot. Are you?

c) Link to a coherent industry white paper or a paper in a refereed computer science or engineering publication that mentions performance differences in x86-64 code that can be manipulated by taking memory controller architecture into account. Please don't re-post the link to that compiler-optimization-for-dummies page from before. Posting the definition of a peephole optimizer is not remotely relevant to this issue.

If you can't do any of these things, then where is the evidence for the false assertions that you seem fixated on making?

You should stop telling people false information that you have no way to back up.

shokunin · Aug 13, 2010

mattmower said:
As I understand it the 2010 Mac Pro has a triple-channel memory controller. So I am wondering... Why the single-CPU has 4 memory slots, instead of 3 (or, better yet, 6)? It seems an odd design choice.

Apple probably leverages the reference motherboard design from Intel. Intel's flagship x58 motherboard only has 4 DIMM slots, while nearly every other mfg has 6. This should really be a server based motherboard since it's using Xeon's but I find it odd that even Intel's x58 board has only 4 dimm slots.

VirtualRain · Aug 13, 2010

trankdart said:
This is 1000% nonsense. It's nothing more than a slogan that demonstrates only that you lack even an undergraduate grasp of computer hardware and software architecture. The last time I asked you about this, you posted a link to a page that contained an elementary overview of compiler optimization techniques. That page mentioned code alignment which is indeed an optimization issue that has nothing whatever to do with channels. But it made no mention of memory channels or memory controller architecture or front-side busses or QPI's, since there is nothing an x86 type compiler can do to code that manipulates the use of a memory controller to enhance performance.

Of course, you can prove me wrong if you can:

a) Link to documentation for a compiler flag that makes the compiler do something different for an FSB or any other memory controller layout (a 64-bit vs 32-bit code flag is NOT such an example...can you explain why not?).

b) Link to or include an example of x86-64 assembly language that could perform differently on an FSB vs. an internal QPI-type memory controller. This assumes you are familiar with x86-64 assembly language, which I realize is an extreme longshot. Are you?

c) Link to a coherent industry white paper or a paper in a refereed computer science or engineering publication that mentions performance differences in x86-64 code that can be manipulated by taking memory controller architecture into account. Please don't re-post the link to that compiler-optimization-for-dummies page from before. Posting the definition of a peephole optimizer is not remotely relevant to this issue.

If you can't do any of these things, then where is the evidence for the false assertions that you seem fixated on making?

You should stop telling people false information that you have no way to back up.

First, I would say there's no need for your overly confrontational tone. Nanofrog is a veteran contributor here and has helped a lot of people with a variety of issues. I'd even go as far as saying he is one of the forums most respected and top contributors.

Now, I've had this debate with nanofrog before (and again just recently) and we agreed there are two aspects to this (1) how the memory bus circuits work and (2) how well the software is architected to get the most out of the memory architecture. I think he's talking about #2. I think you're talking about #1. I think you're both right.

chrono1081 · Aug 14, 2010

I have 16 gb (4 x 4 gb sticks) coming for my hexcore mac pro. Is there any type of benchmark I can run to test which is more efficient (3 or 4 sticks)? This way I will be able to let Mac Rumors know so others can decide which is best for them.

mattmower · Aug 14, 2010

shokunin said:
Apple probably leverages the reference motherboard design from Intel. Intel's flagship x58 motherboard only has 4 DIMM slots, while nearly every other mfg has 6. This should really be a server based motherboard since it's using Xeon's but I find it odd that even Intel's x58 board has only 4 dimm slots.

Oh that's interesting. Trust Intel to push a triple channel memory design and a reference board that makes it impractical to use.

I guess Apple is not willing to put the people on to board designs for the Mac Pro at the moment.

M.

trankdart · Aug 14, 2010

VirtualRain said:
I think you're both right.

Whether we're both right or not, I apologize to Nanofrog for my obnoxious tone. I will keep a lower profile.

TD

nanofrog · Aug 14, 2010

trankdart said:
Of course, you can prove me wrong if you can:

a) Link to documentation for a compiler flag that makes the compiler do something different for an FSB or any other memory controller layout (a 64-bit vs 32-bit code flag is NOT such an example...can you explain why not?).

b) Link to or include an example of x86-64 assembly language that could perform differently on an FSB vs. an internal QPI-type memory controller. This assumes you are familiar with x86-64 assembly language, which I realize is an extreme longshot. Are you?

c) Link to a coherent industry white paper or a paper in a refereed computer science or engineering publication that mentions performance differences in x86-64 code that can be manipulated by taking memory controller architecture into account. Please don't re-post the link to that compiler-optimization-for-dummies page from before. Posting the definition of a peephole optimizer is not remotely relevant to this issue.

If you can't do any of these things, then where is the evidence for the false assertions that you seem fixated on making?

You should stop telling people false information that you have no way to back up.

Let me try and clarify the point. Assuming the application can utilize the architecture's resources at or near their limits (memory controller, physical core count,...), the compiler is by far the largest part of the equation to accomplish that goal.

First, a little background (hopefully will clear a lot of this up).

Providing memory system and compiler support for MPSoc designs: Part 1
Providing memory system and compiler support for MPSoc designs: Part 2

Granted, the articles are written about MPSoC's, the fundamentals still apply (Intel and AMD parts are just more complex). But this means that the compiler team has to take the architecture into consideration when designing it (particularly the Assembly to machine code pass), or machine code generated won't run as efficiently as it could for the intended architecture.

If compiler does this correctly however, the end result is a high level language compiler that the programmer can view it as abstract (there are still flags that can assist with speed optimizations for the core/s, memory, and disk IO for example).

The -xS flag for example, is meant to affect memory performance in SSE4.1 compliant CPU's, and -xT serves the same purpose for SSE3 compliant processors. Their specfic function = advanced data layout and code restructuring optimizations. -O3 is useful for floating point calculations and large data sets (again, memory). Take a look here.

Some assembly information can be found here.

Another optimization article done in C for Nehalem.

White Papers and Manuals here (Volume 3A in particular).

VirtualRain said:
First, I would say there's no need for your overly confrontational tone. Nanofrog is a veteran contributor here and has helped a lot of people with a variety of issues. I'd even go as far as saying he is one of the forums most respected and top contributors.

Thanks.

VirtualRain said:
Now, I've had this debate with nanofrog before (and again just recently) and we agreed there are two aspects to this (1) how the memory bus circuits work and (2) how well the software is architected to get the most out of the memory architecture. I think he's talking about #2. I think you're talking about #1. I think you're both right.

I think this has been the source of the confusion (I'm not dealing with the actual circuits, as they are fixed, and do what they can with what they're given). But rather the compiler and some techniques that the programmer/s may be able to take advantage of.

trankdart said:
Whether we're both right or not, I apologize to Nanofrog for my obnoxious tone. I will keep a lower profile.

Its appreciated, as it makes discussions much more pleasant.

trankdart · Aug 14, 2010

Oh dear. This is becoming probably way off-topic and useless to readers of this forum. I apologize again for the tone of my previous post, Nanofrog. Stuff like that is never warranted and I shouldn't have indulged myself. I swear that this long and tedious response will be my last on this topic.

nanofrog said:
.....
Granted, the articles are written about MPSoC's, the fundamentals still apply (Intel and AMD parts are just more complex).

No. It is precisely because Intel and AMD parts are so different from MPSoC systems that the fundamentals discussed in those first two articles do not apply.

For the narrow, highly-specialized issues of design choices in embedded MPSoC systems, which are virtually always dedicated to a single application or small group of closely-related applications, these issues are valid. But they are not relevant to a Mac Pro user or prospective buyer. Notice that the first article explicitly states:

Such an architectural exploration scheme is quite different from the development of general-purpose computer systems that are designed for good average performance over a set of typical benchmark programs that cover a wide range of applications with different behaviors...in the case of embedded systems, the features of the given application can be used to determine the architectural parameters...Unlike a general-purpose processor, in which a standard cache hierarchy is employed, the memory hierarchy - indeed the overall memory organization - of an MPSoC-based system can be tailored in various ways.

Then there is this:

The memory can be selectively cached; the cache line size can be determined by the application; the designer can opt to discard the cache completely and employ specialized memory configurations such as FIFOs and stream buffers; and so on.

The EE Times articles involve an embedded environment where the designers, (i.e. the guy who designs the CPU and memory systems and the guy who writes the software sit in adjoining cubicles and argue about cache design and behavior) have pretty much complete control over how hardware and software interact. As the article clearly states, this is very different from a mass-market CPU system that dudes liquid-cool and overclock. For example, the articles you cite mention that in power-efficient embedded systems the memory system must be designed carefully because it might account for 90% of power usage (since the CPU's themselves are a bunch of tiny units that may be strictly-speaking "general purpose" but tend to have small array-oriented instruction sets). Observe that here on the Mac Pro Rumors forum, half of the threads involve users asking whether they should buy one or two high-performance GPU cards each of which dwarfs the 130W thermal footprint of a 6-core, very complex general purpose CPU. (Actually the GPU cards more closely resemble the systems discussed in your EE Times article than the Mac Pro does).

The question of optimizing the use of FSB's and memory channels would concern an MPSoC embedded system designer only because he is designing the memory system as well as the specialized and highly constrained software that uses it. He would not have to expose a programming interface to an operating system or an application, in order to optimize them; he would hope that hardware was optimized on paper during the design phase. And his compiler would automatically generate code that accounted for that memory architecture without any special flags. These issues don't face Mac application developers.

If compiler does this correctly however, the end result is a high level language compiler that the programmer can view it as abstract (there are still flags that can assist with speed optimizations for the core/s, memory, and disk IO for example).

There are exceptions; but as a general rule, disk I/O is so far from an application in a heavy-duty general purpose operating system like Mac OS (Unix) that a developer could not even begin to think about writing application-level code that would optimize I/O performance with respect to the architectural specifics of a memory controller.

This is the area where I believe you are conflating two somewhat orthogonal issues. Take for example your example about optimizing cryptography algorithms on the Nehalem. One of several reasons I chose the 6-core Mac Pro was precisely because my favorite desktop open-source crypto utility, TrueCrypt, is optimized (whether by a compiler or hand-written assembler code I don't know) to use the new AES instructions that are present in the Westmere but not in the Nehalem. Compilers can certainly generate better array-manipulation code for different levels of the SSD instruction set, and can even optimize for that dynamically at run time.

But what do we see in your articles when they talk about optimizing for the Nehalem architecture? Nothing very new, and specifically nothing that mentions FSB's vs. QPI's. They say instrument the code, run the code, find out where the bottlenecks are, optimize the code. Hey, wish I'd thought of that!!! That applies to all software, all the time. No non-trivial software can ever be optimized to run optimally on a wide array of unspecified general-purpose systems. So yes, all software, including Photoshop and Final-cut Pro and Plants vs. Zombies and World of Warcraft, could be better optimized if the developer was given a much narrower set of assumptions about the underlying hardware.

I appreciate your work in digging up all those interesting citations. But please note this key point: in the article you cite where memory system architecture is central, the focus is on dedicated embedded systems where software and hardware can be designed and optimized in tandem. In the articles you cite that discuss the optimization of algorithms on general purpose systems such as Intel and AMD, no mention is made of memory controller architecture, particularly FSB's and channels. This is not an accident, because optimization fundamentals in those two kinds of systems are radically different.

In the memory subsystem of a Mac Pro, the hardware puts an address on some pins, and a little while later the main memory puts some data on some other pins. If the memory is under the control of an older memory controller (i.e. an external northbridge chip that adds some latency to the mix), the data comes back a little slower. If the memory is controlled by an integrated multi-channel QPI setup that can achieve more efficient hardware-level concurrency through interleaving, the data comes back a little faster. That's it. In either case, the only optimization that can be done is to overclock the entire memory, or to omit sections of the main memory from the on-chip caches. In either case, although the compiler has lots of leeway to re-order instructions to reduce the liklihood of stalls, and to align instructions on word boundaries to avoid unnecessary memory cycles, there are no local optimizations that either a compiler or a human programmer can do that will create code that runs better when the memory controller is internal with three channels instead of external with two. All your citations are interesting, but none of them, in the final analysis, address the point we have been (perhaps needlessly) disputing: many optimizations are possible to software. Software is never fully optimized. But in the Mac Pro, none of the possible local optimizations (even including word alignment) involve deliberate manipulation of the memory controller's behavior

Best wishes and sorry again for my grumpiness.

TD

Edit: the example given in your Nehalem article about optimizing the RC4 encryption algorithm by manually (not via compiler) re-ordering the C code is instructive since the assembler output for the resulting changes is provided. The example is embedded in an article with Nehalem in the title but this optimization never mentions Nehalem or memory channels and was in fact done on a Harpertown CPU. It's simply a clever way to delay use of an intermediate computational result to avoid a stall. It's a good example of how unpredictable the output of compilers can be and of clever manual loop code optimization. It has nothing to do with memory controller architecture.

Triple channel, why four slots?

macrumors Core

macrumors regular

macrumors 603

macrumors regular

macrumors 603

macrumors 6502a

macrumors newbie

macrumors 68040

macrumors 601

macrumors 65816

macrumors G4

macrumors regular

macrumors 6502a

macrumors 603

macrumors 68040

macrumors 603

macrumors G4

macrumors member

macrumors regular

macrumors 603

macrumors G3

macrumors regular

macrumors member

macrumors G4

macrumors member

Our Staff