Nehalem Mac Pros Arrive: Unboxing and Benchmarks

Flavioparentiq · Mar 11, 2009

mjteix said:
Let me see:
8 REAL cores at 3.00GHz + up to 32GB (of "slower") RAM
4 REAL cores + 4 virtual cores at 2.93GHz + up to 8GB (of "faster") RAM
8 REAL cores + 8 virtual cores at 2.26GHz + up to 32GB (of "faster") RAM
...
8 cores is already a lot of cores, very few apps can benefit from that, Snow Leopard may help a little, but don't get your hopes too high...
For most apps, the GHz is still the best way to get performance.
The new 2.93GHz quad, looks good, if it was not for the RAM limitation and the price (after all, it is just a 2.93GHz Core i7 cpu with ECC enabled, you can find similar desktops for half the price).
Yeah, the new dual-quad 2.26 gives you 16 "cores", but what? 2.26GHz... Is this a joke? $3299? Is this a joke? April 1st? Most benchmarks show that the 2.26GHz model is far from being a rocket. I'll take the old 3.0GHz over it any day.

Then you speak about LONG TERM... Next year Intel will release Gulftown: a 6-core version of nehalem on a 32nm process. So next year, you'll have single cpu models with almost the same number of threads (if that's what you looking for) as your dual-quad 2.26GHz but at speeds around 3.00GHz and at the price of the current quad 2.93GHz...

We all know that every year (or so) there will be a new Mac Pro, some updates will bring huge improvements, some will bring huge price increases, the current one is from the latter category unfortunatly.

geekbench results:

Old Mac pro octo 2.8ghz (10gigs ram): 11688 http://browse.geekbench.ca/geekbench2/view/98156
New Mac octo 2.26ghz (6gigs ram): 13579 http://browse.geekbench.ca/geekbench2/view/116480

Looks like a good improvement and quite a rocket to me

Taipan · Mar 11, 2009

Hm, judging by Tesselator's chart, the 2009 quad 2,93 should have a single core score of about 4000 and a multicore score of at least 16000, which would probably make it a better machine than the 2,26 octo at the same price (as long as there are no real world tests to prove otherwise).
If it proved to accept 4GB RAM modules when I ran into the 8GB limit, I might go for it.

Oops, sorry, just noticed lordderringe already made a similar post.

Flavioparentiq · Mar 11, 2009

Taipan said:
Hm, judging by this chart, the 2009 quad 2,93 should have a single core score of about 4000 and a multicore score of at least 16000, which would probably make it a better machine than the 2,26 octo at the same price (as long as there are no real world tests to prove otherwise).
If it proved to accept 4GB RAM modules when I ran into the 8GB limit, I might go for it.

When I look at this graph, All I can say is that there is soooooo much unused power on multi proc units. Let's hope grand central closes the gap. It's so obvious that new os should use all the cores in most of the operations.

Flavioparentiq · Mar 11, 2009

I thought that maybe doing a list of geekbench benchmarks could be helpful too:

MacPro (early 2009) Intel Xeon X5570 @ 2.92 GHz (16 cores): 15756
MacPro (early 2009) Intel Xeon X5550 @ 2.66 GHz (16 cores): 14657
MacPro (early 2009) Intel Xeon E5520 @ 2.26 GHz (16 cores): 13588
MacPro (Early 2008) Intel Xeon X5482 @ 3.20 GHz (8 cores): 12155
MacPro (Early 2008) Intel Xeon E5462 @ 2.80 GHz (8 cores): 11688

No Quad nehalem benchmarks yet...

nontroppo · Mar 11, 2009

Regarding the cinebench numbers - isn't it testing OpenGL performance too, so the poor 2.26ghz score could be confounded by it using the crappy NVidia 120 vs. the ATI4870 and the older but more powerful NVidia 8800GT?

netkas · Mar 11, 2009

iMacmatician said:
Microarchitecture, but even that isn't usually enough in this case.

lets see how hyperthreading works.

Example: 1 core has 1 simd unit (for sse operations), and only one thread can use it at time. so, if two threads works on one core(HTT), second thread has to wait till first thread will done with simd unit, only then second thread can use it.

same happens with any other executive units in cpu core.

this result in performance hits (which u can see in games performance charts i posted earlier)

1 htt core (2 threads) cant beat 2 real cores.

and, why is it BAD ?

imagine you have new macpro with 1 nehalem xeon cpu, 8 virtual threads.

and you run windows in vmware, attached 2 cpus to VM.

when 2 threads of virtual machine runs on 2 different physical cores - all is ok.

but, when 2 threads of virtual machine runs on 1 physical core - you will get 60-70% performance hit in VM(under full load)

and You cant control which virtual threads it will use.

Not nice imho.

okco · Mar 11, 2009

Question

I have a question for the good people on this forum.

My work load currently involves working back and forth from one application to another in the Creative Suite (illustrator / photoshop / indesign / acrobat)
I handle large files in both Illustrator and Photoshop and also run a good amount of native applications at the same time (mail / itunes / iphoto / ichat)

I will be upgrading to CS4 from CS2 when I get my new pro,

My question is what pro is best for me, and does CS4 make use of 8 cores?

my thinking is that the 4 core would probably do the job now, but with running so many apps at the same time, and the possibility of much bigger files in photoshop and illustrator in the future, am I buying into a system that will quickly be out of date (I want to get at least 3-4 years of of it)

My options are - 8 core 2.26 or 4 core 2.66

deconstruct60 · Mar 11, 2009

gnasher729 said:
Divide the single thread benchmark numbers by the clock speed in GHz. If we increase GHz and keep the RAM subsystem the same, we would expect increasing benchmark numbers, but not quite increasing as fast as GHz. So if we divide by GHz, we would expect that the quotient (benchmark / GHz) should get smaller for the faster GHz. That is not the case here. The 2.26 GHz has only 1021 points per GHz, the 2.66 has 2340, and the 2.93 has 1390. That's 37 percent more performance per GHz for the fastest clock speed, when performance per GHz should actually go down. So something is fishy here. Could be a mistake of the benchmarker, could be some performance bug in the application, but something is wrong there.

Eh?
Before they are wrong because the scaling factor for parallel work is better fro the 2.26. Now they are wrong because don't want to look at parallel work at all , just the scalar / single core numbers are off. OK, let's follow that logic also (again looking out for what the root cause problems are).

The benchmark / GHz number would only get smaller for faster GHz if you bottlenecked the memory system. Otherwise, increase in GHz just means faster trips through the pipeline: so increase in GHz would lead to increase benchmark. If you manage to further uncork the memory system would actually improve.

How is a single core going to bottleneck this new architecture??? It was purposely designed to minimize the impact of 8 cores stepping on each other toes. So if we tell the other 7 to catch a "nap" and throw tons of work at just one core somehow we're going to swamp the memory system. Really????? Why would that happen? There are 3 memory controllers why should a single core not use all three if appropriate and available?

Or...... would that single core leverage the features of the memory system to get an even larger share of the bandwidth than if had to share with the other 7. In that context, that would further uncork any memory delay blockage in the more unbalanced system since can consume more than its "fair share" of memory bandwidth to offset the unbalance.

Your perspective is backwards and seems to want to ignore the negative effects of multiple cores sharing common memory channels and the ability to leverage multiple channels if available.

In short, a 2.93 core is "too fast" for this memory. Give it more memory bandwidth and it gets better performance. Give it less memory bandwidth and it gets worse performance. [ all the cores are "too fast" ... the 2.93 is just king of the "too fast" club with this new generation. ;-) ] There is a ton of L1/L2/L3 , branch predication , and prefetch logic to help it keep plowing forward, but the memory and CPU speeds are unbalanced. Therefore that unbalance is going to often be at the core of performance problems.

The 2.93 is a better "scalar problem" box. The 2.23 is a better "parallel problem" box.

The other factor explicitly mentioned here is the Turbo Boost. Which is when the other 7 are napping some of that power can be using to boost the single one that is left over. I'm going to guess that this is percentage boost of what the nominal clock rate is so that this just washes out for all of the same tech.

Morriss · Mar 11, 2009

okco said:
My work load currently involves working back and forth from one application to another in the Creative Suite (illustrator / photoshop / indesign / acrobat)
I handle large files in both Illustrator and Photoshop and also run a good amount of native applications at the same time (mail / itunes / iphoto / ichat)

My options are - 8 core 2.26 or 4 core 2.66

I would get the 2.26. If you're handling large PS files you need a lot of ram and the 2.66 is limited to 8GB.

Tesselator · Mar 11, 2009

nontroppo said:
Looking at your numbers, if we divide the single thread scores by the processor speed for the Nehalems we get:

4074/2.93 = 1390 per ghz
3572/2.66 = 1343 per ghz
2039/2.26 = 1022 per ghz

Why is the 2.93 36% faster than would be predicted by the clock difference compared to the 2.26 alone? The 2.26 nehalem is even 10% slower than the harpertowns (~1150 per ghz) when compensating for clock frequency. What is going wrong with the 2.26, and shouldn't turbo boost be doing something here!?

EDIT: just saw gnasher made the same point!

Good questions.

nontroppo · Mar 11, 2009

Tesselator said:
Good questions.

Do you know, or can you find out, if the GPUs were different between those results you collated?

deconstruct60 · Mar 11, 2009

schweinsteiger said:
Thanks for the thoughtful response.

Is this pretty much guaranteed? That we can put 4GB or 8GB DIMMs in these slots in the near future? That would certainly make this RAM restriction a non-issue. I don't foresee needing more than 8GB for the next two years or so.

Not guaranteed. Just somewhat likely. Perhaps a bit more likely if Dell, HP, and the other top x86 workstation vendors also come up with Workstations that use the exact same kind of memory that Apple is using here (ECC/Unbuffered). If Apple and the Mac Pros are the only major consumers of 4 MB ECC/Unbuferred/DDR3-1066 DIMMs then would be a price problem.

However, there could be something they have limiting it in the firmware/motherboard. Just seems more likely didn't do extensive testing of that configuration so just don't tell folks about it.

Right now on the Apple store the 8x 2GB memory upgrade option is $500. The 8x 4GB memory upgrade option is $6100. $6100 is almost 2x the cost of a base 8-core box and 2.4x the base 4-core box.... for just memory.

schweinsteiger said:
RAID is something I have never thought about. Would it be a good idea to get the RAID card and have two 1TB mirrored HDs? For safety and speed's sake ... ?

The Apple RAID card is expensive. If all want to do is stripe 2-3 drives not sure really going to get a big bang for the buck there.

If all going to do is RAID 0 or 1 there is a decent argument to be made that can just do that in software. Mac OS X comes with free RAID software to do that. So if just populate two more drives inside the MacPro can RAID 0 (for speed) or RAID 1 (for safety) with just software if have lots of "extra" processor cycles. There is a Softraid program if don't like the free Apple utility.

Mirrored 1 TB drives is just going to give you safety, not much speed.
Generally, smaller drives and striped (RAID 0) will give you more speed.

RAID 1 is better than TimeMachine in keeping up with everything you have done up till the last subsecond is recoverable on a separate drive. It not a long term backup solution though like TimeMachine (and storing drives offsite and rotated.)

If you spend lots of beach ball time opening / printing to a file / savings large files then RAID 0 would make you happier.

There is a way of doing both 0+1 or RAID 10 write stripe and mirror. The disk utility program does that too, but this is area where hardware RAID offload probably becomes more of a good idea. [ may just need a card with no fancy RAID features to give you access some external drives if want to work with 4-5 drives. They are much cheaper. ] If you stop your external RAID 10, 4 drive setup, pull both halves of a mirror, and rebuild with fresh drives.... you just took a snapshot backup.

If ZFS comes with Snow Leopard also get RAID abilities with it also. ZFS is another example of "software RAID" since this file system does with with "cpu power". In future will see folks with 8-core boxes throwing cores at ZFS horsepower requirements.

Hardware RAID is better when thinking of RAID 5 and RAID 6. Of course the ZFS say that RAID 5 and 6 are "bad", but that is probably best left for another thread. With drives getting cheaper and larger at the same time I think RAID 10 is pretty useful if talking a 4-drive RAID 10 set up versus a 3-drive RAID 5 set up. Both probably have "good enough" disk space to reasonable work with (and keep actively incrementally backed up. ). [ ex: 3 500 MB drives in RAID 5 is 1TB of usable space. 4 500 MB drives in RAID 10 is 1TB in usuable space. When 500 MB drives are $120 bucks a piece the difference between the to isn't that much and much less that a fancy RAID card with battery backup. ]

Mirroring very large drives means that your rebuild/recovery time takes longer. It is somewhat safer not to use the largest drives available in a mirroring context.

needles27 · Mar 11, 2009

Right now, one of the best reasons to get the 8 core is for using After Effects. If you spend any time in AE, you will get the greatest benefit. It can farm out frames to render to each core. (And suck up a lot of RAM. It is recommended to have 1.5GB per core)

deconstruct60 · Mar 11, 2009

nontroppo said:
Looking at your numbers, if we divide the single thread scores by the processor speed for the Nehalems we get:

4074/2.93 = 1390 per ghz
3572/2.66 = 1343 per ghz
2039/2.26 = 1022 per ghz

Why is the 2.93 36% faster than would be predicted by the clock difference compared to the 2.26 alone? The 2.26 nehalem is even 10% slower than the harpertowns (~1150 per ghz) when compensating for clock frequency. What is going wrong with the 2.26, and shouldn't turbo boost be doing something here!?

EDIT: just saw gnasher made the same point!

The Nehalems boxes are NUMA boxes. ( there is a possibility of getting non uniform memory accesses: i.e., something stored in memory on the "other" chip package than where this thread is running: perhaps a kernel data structure like a kernel file cache buffer or something like that. )

Depending on how the memory is laid out that this thread and associated kernel threads consume will get different effects. Mac OS X may / may not be laying out memory in the best way for a NUMA box. That effect is negated from the results when both processors labor under the same effects. It would be the same noise in both measurements.

The Harpertown boxes are not NUMA. In fact, the Hapertown design is the optimal for the single core architecture. One bus ( and in this specific case one user/core on the bus so practically no contention problems) and interleaved/paired DIMM pairs. If primarily just fetching a stream of data and getting bigger chunks out to the Hapertown core doesn't get much more favorable than that.

[ edit update: saw that 3572 number if for Quad 2009 2.66. That also is not NUMA since a single CPU package present. Also more than linear speed up is slightly curious. It is not much though. Would expect the Quad 2009 2.93 to have a number closer to 4 or slightly below]

So it is different enough that somewhat are comparing apples to oranges.

Apple's note on Tubro boost seem to suggest it is limited to 13% (their example was 3.33 out of a 2.93 if just one). Not sure if that is relative to the clock speed or to the power that was already coming to the chip. [ Not that those two are entirely decoupled from each other.] The 2.93 is sucking down more power so may have more to divert while keeping the same constant total power consumption for the whole package. The 2.23 power is, if recall correctly, lower in power and not sure if the relation to GHz may differ. If the boost is always relative to clock speed then only going to jump to 2.55 so still would be behind the 2.66 even if memory differences removed. (the differences obviously exist because the 2.23 and 2.93 are on a different performance line. )

iBug2 · Mar 11, 2009

Flavioparentiq said:
geekbench results:

Old Mac pro octo 2.8ghz (10gigs ram): 11688 http://browse.geekbench.ca/geekbench2/view/98156
New Mac octo 2.26ghz (6gigs ram): 13579 http://browse.geekbench.ca/geekbench2/view/116480

Looks like a good improvement and quite a rocket to me

Also take a look at the tests where 2.26 performs better. 2.8 is better on single threaded tasks, ie most stuff.

iBug2 · Mar 11, 2009

nontroppo said:
Regarding the cinebench numbers - isn't it testing OpenGL performance too, so the poor 2.26ghz score could be confounded by it using the crappy NVidia 120 vs. the ATI4870 and the older but more powerful NVidia 8800GT?

The open GL test numbers are not included in those cinebench results.

Tesselator · Mar 11, 2009

The updated Graph is up!

https://forums.macrumors.com/threads/664724/

There's a few more machines to add so refresh the page after an hour or so and they will appear!

deconstruct60 · Mar 11, 2009

iBug2 said:
Also take a look at the tests where 2.26 performs better. 2.8 is better on single threaded tasks, ie most stuff.

Errrrr.
most stuff right now if you only do one thing at a time. Not sure why folks are going to buy 8 core boxes to just run single threaded stuff on just a single core. Seems like overkill to me. The whole point of having 8 cores is to have competing instruction streams. Otherwise, have bought a whole like of transistors that spend most of their time in 'sleep/energy saving' mode.

There is a difference between running just one program/thread at a time and maybe having mulitple possibly limited single threaded apps running at the same time.

Every metric in the memory section is scalar and the 2.26 wins most of those too. The only loser is the stdlib Write ( surprise, the "older" OS possibly doesn't leverage the "new" approach to CPU architecture/NUMA.)

Every metric in the stream section is scalar and the 2.26 wins.

iBug2 · Mar 11, 2009

deconstruct60 said:
Errrrr.
most stuff right now if you only do one thing at a time. Not sure why folks are going to buy 8 core boxes to just run single threaded stuff on just a single core. Seems like overkill to me. The whole point of having 8 cores is to have competing instruction streams. Otherwise, have bought a whole like of transistors that spend most of their time in 'sleep/energy saving' mode.

There is a difference between running just one program/thread at a time and maybe having mulitple possibly limited single threaded apps running at the same time.

Every metric in the memory section is scalar and the 2.26 wins most of those too. The only loser is the stdlib Write ( surprise, the "older" OS possibly doesn't leverage the "new" approach to CPU architecture/NUMA.)

Every metric in the stream section is scalar and the 2.26 wins.

Ok, you want to compare multithreaded stuff, check cinebench. 2.8 wins over 2.26 on multicore rendering. Unless you are constantly going to run Geekbench on the new computer, 2.8 seems like a better deal, considering it's a lot cheaper and gives around the same performance on multi, and better performance on single.

deconstruct60 · Mar 11, 2009

iBug2 said:
Ok, you want to compare multithreaded stuff, check cinebench. 2.8 wins over 2.26 on multicore rendering. Unless you are constantly going to run Geekbench on the new computer, 2.8 seems like a better deal, considering it's a lot cheaper and gives around the same performance on multi, and better performance on single.

just two posts before yours is a graph that has:

2008 Octo 2.8 at 18,907 ( about a 5.8x speed up)
2009 Octo 2.26 at 20,138 ( about a 6.4x speed up)

Where are your numbers coming from?

Even out the 2.8 so that has same memory and hard drive space. Not sure that is going to be a lot cheaper percentage wise. Or it is that the difference between a "used" computer and a "new" one.

Also since sucking down that much less power over time the 2.26 is cheaper to run.

I don't "want to do" multicore ... but it begs to question of way pay thousands of dollars to sit around most of the day and serially run a single threaded , single core programs all day. Right tool for the right job. The 2009 Quad 2.66 gets better single than the 2008 Octo 2.8 if just looking at single. Lower number of "wasted" cores and it is faster if just sticking to one core! The late 2008 iMac is faster too if just narrowing the problem space to just one core.

Tesselator · Mar 11, 2009

deconstruct60 said:
Errrrr.
most stuff right now if you only do one thing at a time. Not sure why folks are going to buy 8 core boxes to just run single threaded stuff on just a single core.

Because this is no longer the case and hasn't been for 2 years. If I count numbers then about 30% of my apps are indeed multithreaded. If I count usage however it's a different story. Of the apps I use the most like, 90% of them are multi-threaded. The other 10% simply can't be. MT can not be implemented for everything.

Of that 90% only about half scale well. "Scaling well" to me, would be indicated by seeing my procs (all 8 or all 16 in the case of the new machines) at all 50%. Several of my apps scale perfectly or almost perfectly - all procs at 100% during a heavy execution cycle. Many applications and application suits allow the user to set the number of core threads it uses. This is awesome because one can set 4 threads to something that number crunching in the BG, 2 cores to whatever it is you're working on, and etc. This kind of environment is very sweet to work! It's like working on multiple machines and sharing all resources full time!

Still, this is very significant! Significant enough by any working man's standards to justify paying for the multiple procs.

Also, I guess when Snow Leopard is released most application MT profiling will scale up across the board. That's the rumor anyway. I guess we'll find out soon enough.

You're right in some ways though. If all you're doing is surfing the web, reading mail, and looking at a few images then you don't need a MC/MP machine!

.

nontroppo · Mar 11, 2009

Tesselator said:
The updated Graph is up!

Cool, looks like the 2.26ghz results are now "normal" when corrected for frequency on the single-threaded tests, was that an updated run or a different machine?

4074/2.93 = 1390 per ghz
3572/2.66 = 1343 per ghz
3142/2.26 = 1390 per ghz

However looking at geekbench results, the single threaded results of the 2.26 still look lower when corrected for frequency...

Tesselator · Mar 11, 2009

nontroppo said:
Cool, looks like the 2.26ghz results are now "normal" when corrected for frequency on the single-threaded tests, was that an updated run or a different machine?

However looking at geekbench results, the single threaded results of the 2.26 still look lower when corrected for frequency...

What do you mean "corrected for freq." ??
It was posted by a different screen name so i assume it was a different machine.

BTW, some people my not see the new graph without refreshing their browsers - as I'm using the same file name over and over on the host server.

So if you're not seeing the goodness just reload the page - it should update.

nontroppo · Mar 11, 2009

Tesselator said:
What do you mean "corrected for freq." ??

Divide the score by the processor speed to determine if the score scales linearly with clock. Your first graph had a 2.26ghz nehalem score which was anomalously low when compared to other nehalem scores...

Morriss · Mar 11, 2009

That chart is really looking good! Yeah, the 2.26 Octo is about on par with the 2008 2.8 Octo, now.

Nehalem Mac Pros Arrive: Unboxing and Benchmarks

macrumors regular

macrumors 6502a

macrumors regular

macrumors regular

macrumors 6502

macrumors 65816

macrumors newbie

macrumors G5

macrumors member

macrumors 601

macrumors 6502

macrumors G5

macrumors newbie

macrumors G5

macrumors 601

macrumors 601

macrumors 601

macrumors G5

macrumors 601

macrumors G5

macrumors 601

macrumors 6502

macrumors 601

macrumors 6502

macrumors member

Our Staff