Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
msql

I'd like to see some mysql benchmarks. In my experience mysql uses threads pretty well, so I'm hoping it is a big improvement.
 
It shouldn't, but Apple in its infinite wisdom doesn't really give you many choices.
There's always the choice NOT to buy the 8-core Mac Pro. If you want to game, you're better suited with any of the cheaper Mac Pros, which would offer greater relative performance per dollar. Of course, your game selection per dollar will also be far better with a Windows PC.

Bottom line is that you can't have it all, and Apple is just not that terribly interested in tackling the hardcore gamers or the corporate market. They've done quite a bit to assist Macintosh gamers to provide more and better titles, but they've never tried to take on DirectX game developers.
 
That's for 8 cores, 2 CPUs with 4 cores each, each CPU having its own bus for its 4 cores, which are on two dies. My point is entirely accurate.

You're right - I read your post too quickly - I deleted my response.

How do you quantify AMD's problems with a NUMA architecture, though?

Because the AMD memory is connected to each socket, CPUs in that socket access the local memory through the memory controller. If the memory happens to be connected to the other socket, the serial HT bus is used to access the remote memory controller - adding latency.

On a 4 socket system, the sockets are connected in a ring. If the memory happens to be on the "far" socket, the memory access has to go across two HT links in series, adding even more latency.

AMD's architecture has nice numbers when you add them all up, but when you actually benchmark the 4 socket machines you see a pretty significant variability in runtimes due to the extra latencies.
 
(note however that coherency traffic will be *very* significant on an 8-core system)

Intel make up for it with good prefetchers in their cores, but these are less effective with more cores, and a high amount of cache. Barcelona has better prefetchers however, and larger cache, so it's probably a wash in the end. So AMD's greater memory bandwidth per core is what seals the deal.

I think the coherency traffic is what is killing the 8-core. Since 10.4 doesn't prevent core swapping, you've got a tremendous waste of scarce memory bandwidth on that problem. My theory is that running only 7 threads will eliminate most of that core swapping, or raising priority on your work threads will also cut the number of core swaps.

There is also the possibility that many programmers tested their code on machines that don't have as big a penalty for False Sharing. (False Sharing causing repeated cache coherency traffic between two threads accessing different, but nearby, data.) Some of the slowdown on 8-core could be eliminated by optimizations in the programs.
 
your game selection per dollar will also be far better with a Windows PC.

I assume that you're just ignoring boot camp as an option. I run boot camp marvellously on a second hard drive, with a seperate GPU, I could run it with the standard 7300, but I found it fairly easy to install an 8800 instead.
 
How do you quantify AMD's problems with a NUMA architecture, though?

It is an extension of affinity. Once the OS knows about affinity (including memory affinity), the problem is not so bad. The OS knows a thread is happiest on a certain core, and it knows memory allocations from that core should come out of the memory range directly connected to that core. Problem solved.

In the cases where the thread then runs on a different core... tough beans.

XP has this in a glitchy way, Vista has it fully implemented.
 
All this power in the 8-core Mac Pro and not that much software to take advantage of it...:rolleyes:

It's always a chicken/egg problem. You can't really expect software companies to release versions optimized for eight cores before the hardware is released.

At least Apple is going to have at least some apps that take advantage when FCS2 ships next month. Hopefully Adobe will update soon (if not have 8 core in time for release, it's not final yet, is it?).
 
You're right - I read your post too quickly - I deleted my response.

No problems, I do the same too often :)

How do you quantify AMD's problems with a NUMA architecture, though?

Because the AMD memory is connected to each socket, CPUs in that socket access the local memory through the memory controller. If the memory happens to be connected to the other socket, the serial HT bus is used to access the remote memory controller - adding latency.

On a 4 socket system, the sockets are connected in a ring. If the memory happens to be on the "far" socket, the memory access has to go across two HT links in series, adding even more latency.

AMD's architecture has nice numbers when you add them all up, but when you actually benchmark the 4 socket machines you see a pretty significant variability in runtimes due to the extra latencies.

This is very true, although the OS can help significantly when it knows it is running on a NUMA architecture machine. Even without you have an average hop number of around about 1, which is the same as the CPU <-> Chipset on the Intel equation.
 
To sum things up

The 8-core is not as fast as expected/hoped because of three things combined:

1. Tiger is suboptimal when it comes to parallel execution.
2. The memory bus is a slowing factor.
3. The software isn't parallelized enough to benefit from 8 cores.

Agreed?
 
The 8-core is not as fast as expected/hoped because of three things combined:

1. Tiger is suboptimal when it comes to parallel execution.
2. The memory bus is a slowing factor.
3. The software isn't parallelized enough to benefit from 8 cores.

Agreed?

#3 is only true in some cases, like games. Even when the software is parallelized for 8-core, it is still running slow because of #1 & #2. It depends on the data access & computation patterns.
 
I assume that you're just ignoring boot camp as an option. I run boot camp marvellously on a second hard drive, with a seperate GPU, I could run it with the standard 7300, but I found it fairly easy to install an 8800 instead.
By "Windows PC" I mean a Core 2 Duo/Core 2 Extreme desktop system--I should have specified. You don't see many gamers building Xeon rigs, for a very good reason. You can get superior gaming performance for a cheaper price by skipping the FB-DIMMs and the workstation motherboards.

Certainly, boot camp would be an option if you wanted to game with a Mac Pro, but you still wouldn't want to spend the premium for an 8-core version because you'd get almost no return on that extra investment (for gaming). The much cheaper quad core would offer almost identical performance, and leave room in the budget for a killer graphics card.
 
Barefeats speculates that the 8-Core Mac Pro maybe bottlenecked by the memory bus and also considers the possibility that Mac OS X Tiger may not be well optimized for the 8-Core Mac Pros.

Get ready for the "I'm waiting for the REAL 8-core MacPro" posts...
 
It's always a chicken/egg problem. You can't really expect software companies to release versions optimized for eight cores before the hardware is released.

Huh? There's nothing to say the hardware has to be in existence before the software is optimized for it. There's plenty of mathematics/genetics software made to scale up to 64 or 128 processors.

Also, remember that the original Cloverton upgrade was performed months ago. So it's not like it would be impossible to have built an 8-core Mac Pro before so you have something to test on.
 
Get ready for the "I'm waiting for the REAL 8-core MacPro" posts...

In a way, they have started.

1. Some people want a single chip with 8 cores (nobody has posted that yet here)
2. some people want a system that has a total of 8 cores (two chips with 4 cores), that doesn't have such bottlenecks.

This 8 core system reminds me of the early Macintosh G4 Duals. Yes you had dual processors, but the system board wasn't suited (bandwidth of everything) to get data in and out really fast. Hence why those early G4s were nothing near twice as fast as a single processor system at the same clock.


Reminds me of the pre Merom days here in Macland. I wonder how many people who are complaining about this 8 core system and want something better, are even able to buy / afford the product in the first place.

I kept a running tally of how many Merom "chronic" complainers there was, and how many actually got systems at or near launch time. Lets just say that the percentage of those who actually got one(or pretended to) was well below 60%!
 
That's for 8 cores, 2 CPUs with 4 cores each, each CPU having its own bus for its 4 cores, which are on two dies. My point is entirely accurate.
I don't think there is actually a major problem problem with 1333MHz FSBs for quad cores. Besides, it's simply too simplistic to do some division and say there is only 333MHz of equivalent bandwidth per core. For one thing, the unified L2 cache means it's more like 667MHz per L2 cache. In terms of cache coherency, the snoop filter does help with that. And you can't just say that Netburst started with a 400MHz FSB. Merom is a fundamentally different architecture and seems much more FSB bandwidth agnostic.

You can check benchmarks between the 1.8GHz E4300 with a 800MHz FSB and the 1.86GHz E6300 with a 1067MHz FSB and the results are nearly identical with variations more likely do the slight clock speed advantage of the E6300 than the FSB difference. What's more these parts only have 2MB of L2 cache and so larger caches would buffer the difference even more. A comparison between mobile Core 2 Duos and desktop Core 2 Duos actually shows them performing similarly at similar clock speeds and cache sizes dispite Merom only having a 667MHz FSB and Conroe having a 1067MHz one.

Yes, in the most bandwidth intensive applications a 1333MHz FSB will likely not be enough, but the bottleneck in the Bensley platform is not the FSB but the memory controller. FB-DIMMs in general are very inefficient and the first-gen MCs are even worse. Tests have shown that a quad channel DDR2 533 FB-DIMM setup (4-4-4 timings) only has as much bandwidth as a dual channel DDR2 667 setup (5-5-5 timing) and has much higher latency. There is nothing wrong with the 1333MHz FSBs, because currently they just don't have enough memory bandwidth to fill them so increasing the FSB won't solve anything.

http://www.theinquirer.net/default.aspx?article=38961

Still, Intel is moving to dual 1600MHz FSBs for Xeon Penryn variants, but the major improvement in terms of memory bandwidth is not the faster FSBs, but the redesign 2nd-gen MC in Stoakley. This should bring better bandwidth utilization and hopefully better scaling from dual to quad channels and it looks like Intel is specifically focusing on reducing latency which is a good thing. The Merom architecture seems to be better than Netburst in that it isn't as sensitive to raw bandwidth, but in exchange it is closer to K8 in that it is more sensitive to latency so Stoakley should help in all regards. Stoakley also has a optimized snoop filter for quad cores which is help reduce bandwidth requirements for cache coherency.
 
It seems like this is another rushed release from Apple. It had been a while since the Mac Pro saw an update, the 8 core was sort of out and they released it to be ahead of the competition.

Something tells me we'll see a lot of these types of releases by Apple in the near future. The main focus right now is on the electronic gadgets side of things. Unfortunetly, this is where there's a lot of money to be made. Mass consumers electronics. Apple TV, iPhone. The delays for Leopard is one example. They need to create some sense of expectation. Keep the favorable rumors going.

We'll see product updates in minor ways and the excuse software wise will be the pending release of Leopard. The coming months will be critical for Apple and with the looks of things they are focusing where there's money to be made and that's not computers.

And where do you come by this profound wisdom, and ability to see the future? If Apple does not release it, people like you will be screaming rape for them not being responsive to the 'latest and greatest'. When they do release it, they get hammered because it does not live up to someone's unfounded fantasy of what it should be.

The Clovertowns were widely discussed on this site, long before they were ready for production. It was generally agreed the current architecture would not be sufficient to take advantage of the doubling of the processors. Or, more precisely, the advantages would be negligible. So, no one should be surprised when that is what happens.

As for where Apple is focusing their effort, unless you sit in their strategic planning sessions, your credibility is no better than your attitude.
 
As for where Apple is focusing their effort, unless you sit in their strategic planning sessions, your credibility is no better than your attitude.

We don't need to sit in their strategic planning sessions. We already know

  1. They removed "computers" from their corporate name
  2. The key note speech at Mac World did not even talk about Macs
  3. They took key people off the Leopard project so they could work on a phone

Taken together we can see what's going on.
 
AMD use Coherent Hypertransport (marketing name: Direct Connect) between the CPUs in their systems, and an on-CPU memory controller (dual-channel DDR2). Direct Connect is 8GBps per link currently, going up to 20GBps later this year. Registered DDR2 is available at 667MHz, maybe 800MHz now, so that's 10-12GB/s per CPU (theoretically).

That means in a 4 CPU (8 core) system you have 8 DDR2 memory controllers, which provides a boat-load of bandwidth. With Barcelona in a couple of months you'll have a 2 CPU (8 core) system with 4 DDR2 memory controllers, so not as good, but more compact (and better than accessing memory over the FSB).

Assuming the latter, and ignoring coherency traffic for an 8 core system:

AMD Barcelona: 21 - 25 GBps memory = 2.6 - 3.2 GBps per core.
Intel: 2x1333MHz FSB = 20 GBps memory = 2.6 GBps per core.
(note however that coherency traffic will be *very* significant on an 8-core system)

Intel make up for it with good prefetchers in their cores, but these are less effective with more cores, and a high amount of cache. Barcelona has better prefetchers however, and larger cache, so it's probably a wash in the end. So AMD's greater memory bandwidth per core is what seals the deal.

Thanks for the info. With computer components changing so quickly, it's hard to keep up. Only question is since each CPU has it's own memory controller, how do they keep from trying to access the same address in memory? Does each CPU get its own bank of RAM? If that's the case, then sharing info between the CPUs would be hard, but not impossible, I guess if you had a good OS.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.