Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
AidenShaw said:
...I bet that you're right that there are a bunch of angry scientists. Angry that the university spent millions on a PR "proof of concept" that was unusable from the get-go...

The deadline last year wasn't just for the Top500 and it wasn't just for PR (although the need for PR is a fact of life too). It was a funding deadline for the National Science Foundation (their Cyber Infrastructure program).

The company VT FIRST wanted to use--Dell--was unable to offer sufficient power for the available budget, even though Dell was offering a special pricing deal just for VT. Only Apple was able to offer sufficient power for a low enough cost and on time to get the NSF funding--even though Apple charged VT full education price.

(But Apple WAS willing to cut deals like Dell the second time, clearly. The question I've never seen answered is whether an Xserve deal was worked out with Apple from the very start or not.)

And yes, silly though it may seem, benchmarks and meeting a certain date is what the funding hinged upon. VT wasn't in control of that system. I'm sure they'd have rather waited for XServe G5s, but they simply didn't exist yet.

Here's an article explaining some details, and touching on VT's "Deja Vu" fault-tolerance software.

http://www.unirel.vt.edu/vtmag/winter04/feature1.html

"We believed that we could build a very high performance machine for a fifth to a tenth of what supercomputers now cost, and we did."
 
MacinDoc said:
(Sorry for the double post - didn't know how to add a second quote with your link when I tried to edit my last post.)
For future reference, here are two techniques I've used:

1. To avoid having to retype the text you want to quote, control-click on the Quote button for the first post, and pick Open Link in New Window. Select the content and copy it to the Clipboard. Do not click the Submit Reply button! Close that window to return to the thread. Click Quote for the second post, and paste from the Clipboard at the top, above the second quote. Then add your comments.

2. More simply, you can type the "QUOTE=" syntax yourself and cut and paste the content from the thread. The syntax is as follows:

membername said:
blah blah blah

Just be sure to quote correctly and attribute it to the right member.
 
The 2.3 GHz Xserves make sense to me. As we all know, there's a 500 MHz gap between top-end and mid-range PMs. So, I've been wondering for quite while what happens to all those 970FX rated between 2.1 and 2.4 GHz. This seems to be the answer.
 
So, I've been wondering for quite while what happens to all those 970FX rated between 2.1 and 2.4 GHz. This seems to be the answer.

VT uses just 2200 cpu's, so this can't be the answer.
 
Time & Space

MacBytes said:
Category: Apple Hardware
Link: Virginia Tech G5 supercomputer upgraded to 2.3GHz Xserves
Posted on MacBytes.com

Approved by Mudbug


When do these boys ever do any work on their machine? They always seem to be upgrading it; from PowerMacs to XServes to faster XServes. I suppose with the physical space saved from the first upgrade they had enough room to install the 2.3GHz XServes with the old one still in place. Where do they get the money and time?

Sanj
 
One upgrade or two?

But did they REALLY ever have 2.0 Xserves? I thought the racks were sitting empty for quite some time after the PowerMacs were gone. And I never saw any news of "Xserve update complete" until several weeks ago--with 2.3s.

Someone at AI said that they called VT and was told a complete Xserve upgrade was done in January. But most things I read still suggest that Big Mac has only been upgraded once, and that January was just when they announced their plan.
 
cc bcc said:
VT uses just 2200 cpu's, so this can't be the answer.

Okay, then it's part of the answer. However, we know that supplies of 2.5 GHz is still pretty limited, so we don't know how many 2.0 + GHz CPUs IBM can ship to Apple. Perhaps, VA Tech is not the only customer getting 2.3 GHz Xserves.
 
This makes me want to see what they can do when System X is upgraded in capacity, and not only in current hardware. Virginia Tech's TerraScale Computing Facility is already working out plans to boost the overall size of the cluster from current levels. The current system is known as System X, but it's derivatives will be called System C and System L, in tune with roman numeric systems.

Go on, do the math...

They're talking about a five and tenfold increase in the servers, to 5,500 boxes and 11,000 boxes. If COLSA is any indication, that could lead to an enormous boost in overall power, since the Mach5 cluster is a bit less than twice as fast with an addition of a mere 400-500 machines.
 
IBM rules supercomputing / superclusters

I agree with thatwendingo that it'll be interesting to see how fast Mac clusters can become with 5x and 10x the nodes as the current System X.

However, the forthcoming IBM Blue Gene/L system will be just sick.

The New York Times said:
A large-capacity version of the Blue Gene/L system is scheduled to be installed early next year at the Lawrence Livermore National Laboratory in Livermore, Calif. That machine will have about 130,000 processors, compared with the 16,000-processor prototype that set the speed record.

In other words, IBM just surpassed NEC's Earth Simulator with 16,000 processors at 36 teraflops. IBM's forthcoming 130,000 processor machine will be amazing. 8.125x the power makes for 292+ teraflops if it scales linearly, which of course it likely won't. Even at 200+ teraflops, wow.

Put another way, IBM's "pro" line of supercomputers is Blue Gene/L, and the "consumer" line is PPC970fx / Apple Xserve G5. Blue Gene/L is haute couture and Xserve is off the rack (no pun intended).
 
Rod Rod said:
Put another way, IBM's "pro" line of supercomputers is Blue Gene/L, and the "consumer" line is PPC970fx / Apple Xserve G5. Blue Gene/L is haute couture and Xserve is off the rack (no pun intended).
Sure, but how much does it cost (both to build and to operate). Let's see lists of the top purchase and operational price per performance ratios. I bet Apple would dominate there, even based on the current, relatively-primitive G5 core (compared to a Power-5-derived core which we'll probably see late next year).
 
Rod Rod said:
In other words, IBM just surpassed NEC's Earth Simulator with 16,000 processors at 36 teraflops. IBM's forthcoming 130,000 processor machine will be amazing. 8.125x the power makes for 292+ teraflops if it scales linearly, which of course it likely won't. Even at 200+ teraflops, wow.

The COLSA Mach5 cluster is hitting about 25 teraflops with 1,566 nodes, at a cost of roughly $7 million for the entire installation process. The Earth Simulator hits 35.86 teraflops with 5,120 processors and a cost of over $350 million. BlueGene L doesn't have a stated cost yet, but assuming a 50% scalability from its 11.68 terraflop, 8162 node performance legel, you'd get 93.44 terraflops at 130,592 nodes. If it were to scale at 75%, you'd see 140.16 teraflops.

By contrast, System X got 10.28 teraflops at a cost of $5 million and using the previous generation of xServes. If you expand that by a factor of five, you'd see a performance boost to 25.7 with a 50% scalability, or 38.55 teraflop with a 75% scaling. At that step, the cluster would cost something like $25 million if one assumes that all expenses had to be encountered again (which likely isn't the case). This also hammers the hell out of the LLNL Tiger4 Itanium cluster, which costs $20 million and scores less than 20 teraflops. Take it another iteration to 11,000 nodes and you encounter a theoretical 51.4 teraflops at 50% scaling, or 77.1 teraflops for 75% scaling, at a cost of $50 million (with the same assumption about expanding costs, which is still unlikely to be true).

Mathematically:
System X (1,100) - 10.28/5 = 2.056 teraflop per million spent
COLSA Mach5 (1,566) - 25/7 = 3.57 teraflop per million spent
NEC EarthServer (5,120) - 35.86/350+ = 0.102 teraflop per million spent
BlueGeneL Prototype (8.162) - 11.68/5 = 2.336 teraflop per million spent (assuming the generously low $5 million pricetag)
Lawrence Livermore Tiger4 Itanium2 (4,096) - 19.94/20 = 0.997 teraflop per million spent

Are we noticing a trend here on bang for buck?
 
looking at one right now...

shamino said:
It takes more than a processor upgrade to turn a PowerMac into an Xserve. I'm pretty sure they got more than new CPU chips :)
VT sold their G5 PowerMac cluster nodes. They were available through (I think) MacMall for a few months. So they didn't lose their investment. IIRC, they actually sold for close to Apple's original price - people were willing to pay that much because the computers were virtually new and included some certificate of authenticity stating that the computer was a VT cluster node. (Yes, VT didn't get the full purchase price, since the store took a cut, but VT also didn't pay full retail price - they paid Apple's educational institution price.)

yeah, my boss bought one for the office as a refurb (macwarehouse maybe?) he kept all the packaging, but intentionally threw out the certificate. i have no idea why...

anyway, it has that boisterous fan issue and it's sitting on an adjacent desk about 3 feet away. LOUD.
 
what work did it do, then?

EminenceGrise said:
That's patent nonsense. The original cluster was not "so unstable that it was unusable for doing any real work" by any stretch.


Sorry, but I've talked to some of the engineers from Mellanox, and they've described the problems with stability. It's no secret inside the IB community that the first cluster was never put into production use.

Also, your descriptions of "voting" schemes for error detection are really simplistic. You seem to assume that any memory error will show up as an error in a calculation.

That's not at all the case. Some will show up as math errors, other will manifest themselves as kernel panics or program crashes. Some will corrupt files or send bad data over the network, and the really nasty ones will cause wrong branches to be taken on code paths - so that almost anything could happen as the program (or OS) goes screaming down the wrong code paths.

All the descriptions of Déjà vu (like http://www.apple.com/education/science/profiles/vatech/scheduling.html and http://www.vtnews.vt.edu/Archives/2004/June/04263.htm) describe it as "transparent parallel checkpointing and recovery" - in other words it deals with system failures and restarts a job. Nothing is ever mentioned about voting schemes and "software ECC" - nothing.

Remember that the VAtech cluster was dismantled and sold almost as soon as the XServe G5 was announced.

It would have been quite easy to run the cluster, and rack-by-rack replace the Power Macs with the XServes - little or no downtime would be needed.

Instead, the cluster was completely dismantled and has been offline for the last nine months.

Can you explain why they would have taken a "usable" cluster offline for nine months? I can't explain that - and therefore find it much easier to believe the stories from the Mellanox guys that it was unstable to the point of unusable.

They would not have pulled the plug on a working multi-million dollar machine.

They would not have suffered the embarrassment of falling completely off the Top500 list unless there were real problems with usability.
 
apples to apples, please

thatwendigo said:
The COLSA Mach5 cluster is hitting about 25 teraflops with 1,566 nodes, at a cost of roughly $7 million for the entire installation process. The Earth Simulator hits 35.86 teraflops with 5,120 processors and a cost of over $350 million.

By contrast, System X got 10.28 teraflops at a cost of $5 million and using the previous generation of xServes.


The $/FLOP comparisons are mostly nonsense, because different methods of calculating the cost are used for each system.

The Earth Simulator price covers the entire building, including special seismic bracing. It's a "bare lot to supercomputer" pricetag.

The Virginia cluster $5.2M figure doesn't include things like the air conditioning and power - it leaves out much of the expenses that are included with other systems.

So, yes it was a bargain - but calculating $/FLOP isn't that meaningful when different methods of arriving at the cost are used.

Mathematically:
486DX2 66MHz ($1 on eBay) (1) - 0.00000240/0.000001 = 2.4 teraflop per million spent ;)


(And, BTW, it used Power Mac G5 systems, not an older XServe.)
 
AidenShaw said:
The Virginia cluster $5.2M figure doesn't include things like the air conditioning and power - it leaves out much of the expenses that are included with other systems.

Source, please. As I remember reading it, the price tag includes the HVAC and other considerations for the cluster.

(And, BTW, it used Power Mac G5 systems, not an older XServe.)

Oh, yeah. That's what I get for posting early. :eek:
 
no problem

thatwendigo said:
Source, please. As I remember reading it, the price tag includes the HVAC and other considerations for the cluster.

http://www.computerworld.com/softwaretopics/os/macos/story/0,10801,86704p3,00.html

The system costs $5.2 million for the G5s, racks, cables, and Infiniband cards.

Virginia Tech spent an additional $2 million on facilities, $1 million for the air conditioning system and $1 million for the UPS and generator.

So, $9.2M (or maybe $7.2M, depending if they meant "facilities=air+power" instead of the "facilities+air+power" that they wrote) and we haven't factored in building and real estate costs, the "slave labor" and other things unique to the VAtech cluster.

Another http://www.nwfusion.com/news/2004/0531environmental.html : "Shinpaugh says the school spent about $2 million for the cooling devices and adding power"


So, it was a bargain, but please don't create tables showing $/FLOP :mad: .


Editor's note: An obvious typo was corrected in the first quote.
 
thatwendigo said:
The COLSA Mach5 cluster is hitting about 25 teraflops with 1,566 nodes, at a cost of roughly $7 million for the entire installation process. The Earth Simulator hits 35.86 teraflops with 5,120 processors and a cost of over $350 million. BlueGene L doesn't have a stated cost yet, but assuming a 50% scalability from its 11.68 terraflop, 8162 node performance legel, you'd get 93.44 terraflops at 130,592 nodes. If it were to scale at 75%, you'd see 140.16 teraflops.

I don't know why you're extrapolating performance from the 8,192 processor prototype. Really, please read the New York Times article. You don't even have to register at nytimes.com because c|net has the article too.

It makes much more sense to begin your extrapolation from the 16,000 processor system rather than the 8,192 processor one.

From the article it's not clear what type of processor is in IBM's 16,000-processor, 36-Tflops system, but look: It's roughly double the processors and TRIPLE the Tflops compared to the prototype.


The New York Times said:
The new system is notable because it packs its computing power much more densely than other large-scale computing systems. Blue Gene/L is one-hundredth the physical size of the Earth Simulator and consumes one twenty-eighth the power per computation, the company said.

HiRez, my bets are on the Blue Gene/L for lower costs. 28x the power efficiency and 1/100th the size should make for some savings, sort of like comparing vacuum tubes with transistors.
 
lies, damned lies, and benchmarks

Rod Rod said:
HiRez, my bets are on the Blue Gene/L for lower costs. 28x the power efficiency and 1/100th the size should make for some savings, sort of like comparing vacuum tubes with transistors.

Amdahl's Law would favor the Earth Simulator - since its processors are 30 times faster than Blue Gene/L processors.

The LINPACKD benchmark is very good for parallelization - it can be carved into as many pieces to run in parallel as you want.

For applications that aren't as easily parallelized - say those that can't be cut into more than a few hundred jobs - adding *more* small processors won't help, you want *faster* processors.

Just be careful extrapolating too much from one particular (and some say "peculiar") benchmark like LINPACKD.
 
Rod Rod said:
\HiRez, my bets are on the Blue Gene/L for lower costs. 28x the power efficiency and 1/100th the size should make for some savings, sort of like comparing vacuum tubes with transistors.
You could be right but even though IBM could be considered a competitor to Apple in this area, I'm rooting for IBM for any Power-based success because that success will eventually trickle down to the Mac. And not just Xserve-based superclusters, it should eventually benefit my lowly desktop PowerMac too :)
 
Totally missing the point

HiRez said:
Sure, but how much does it cost (both to build and to operate). Let's see lists of the top purchase and operational price per performance ratios. I bet Apple would dominate there, even based on the current, relatively-primitive G5 core (compared to a Power-5-derived core which we'll probably see late next year).

BlueGene is a totally superior architecture from top to bottom, and scalable in far more ways than a cluster. A cluster is NOT a supercomputer, and there are a lot of applications that will drag ass on a cluster (worse performance than a medium SSI box even) that will scream on a genuine supercomputer. Apples and oranges. Clusters are very limited in their application, true supercomputers generally are not.

At the end of the day, for most apps the specific core matters less than other architectural details such as memory and I/O architecture. The POWER series processors aren't better than the G5 because they have a radically superior core, but because they have superior memory and I/O architectures. What will be interesting to see is if IBM merges their POWER series into their PPC series eventually. The primary reason that the Opterons are a little faster than G5s (assuming Pathscale and IBM high performance compilers respectively) all things generally being equal is that the Opteron has some of the features that IBM reserves for their POWER series processors. As it stands the biggest limitation of the PPC processors is their markedly inferior memory subsystem compared to AMD and arguably even Intel. IBM knows how to make good memory subsystems, they just didn't put them on the PPC.
 
LINPACK TFLOPS aren't general purpose TFLOPS

thatwendigo said:
This also hammers the hell out of the LLNL Tiger4 Itanium cluster, which costs $20 million and scores less than 20 teraflops.

LINPACK teraflops aren't real teraflops (more like DSP TFLOPS really), and so for many codes $20M worth of Itanium will soundly beat $20M worth of G5 -- Itanium is the king of general purpose floating point, and the memory subsystem is decent. (Actually, with the best compilers available for both architectures, Opterons have slightly better general purpose floating point performance than G5s.) LINPACK measures DSP-like floating point loads, a narrow application space that PPC does extremely well at. Apparently LLNL has one of the many, many supercomputing applications for which LINPACK is a ridiculously uncharacteristic benchmark. Many people do.

My supercomputing applications are similarly uncharacteristic, but in a different fashion, being memory intensive mixed loads. The Opterons thrash the G5s for these codes, and so we use big Opterons for supercomputing purposes. I still use Apple systems for most everything else though.

LINPACK TFLOPS are pretty marginal as a benchmark, and you'll find that the reason so many clusters are still built on other architectures is that other architectures often give superior bang-for-the-buck for the types of applications those clusters are being built for.
 
tortoise said:
LINPACK teraflops aren't real teraflops (more like DSP TFLOPS really), and so for many codes $20M worth of Itanium will soundly beat $20M worth of G5 -- Itanium is the king of general purpose floating point, and the memory subsystem is decent. (Actually, with the best compilers available for both architectures, Opterons have slightly better general purpose floating point performance than G5s.) LINPACK measures DSP-like floating point loads, a narrow application space that PPC does extremely well at. Apparently LLNL has one of the many, many supercomputing applications for which LINPACK is a ridiculously uncharacteristic benchmark. Many people do.

My supercomputing applications are similarly uncharacteristic, but in a different fashion, being memory intensive mixed loads. The Opterons thrash the G5s for these codes, and so we use big Opterons for supercomputing purposes. I still use Apple systems for most everything else though.

LINPACK TFLOPS are pretty marginal as a benchmark, and you'll find that the reason so many clusters are still built on other architectures is that other architectures often give superior bang-for-the-buck for the types of applications those clusters are being built for.
That explains why VTech went with PPC G5s (and others went with something else) - the chosen architecture performs better with the specific task(s) the creators of the cluster had in mind. I'm a little surprised to learn that Linpack FLOPs are not applicable to the real world in most cases.
 
one other reason PPC gets unrealistic LINPACK numbers

tortoise said:
LINPACK TFLOPS are pretty marginal as a benchmark, and you'll find that the reason so many clusters are still built on other architectures is that other architectures often give superior bang-for-the-buck for the types of applications those clusters are being built for.

Another factor that makes the LINPACK results for PPC (and POWER) better than real-life performance is that the PPC ISA has a MADD instruction - a floating point MULTIPLY and ADD in one instruction.

To quote IBM:

The total cluster will have a theoretical peak capacity of over 40 trillion floating-point operations per second. The following table explains how theoretical peak capacity is calculated.

theo-peak_table_1b.gif


In practice, only a small portion of peak capacity is achieved because a processor is rarely scheduled to do simultaneous “multiply and adds” in double precision.

However, the LINPACK benchmark, which is often used to rank supercomputers (the Top500 Supercomuter list), makes extensive use of simultaneous multiply and add. Massively parallel systems regularly achieve between 50 and 90 percent of peak performance.

http://www-306.ibm.com/chips/products/powerpc/newsletter/jun2004/ppc_process_at_work.html


There's an article discussing other LINPACK issues and a new supercomputer ranking system at Supercomputer ranking method faces revision.
 
totally superior - not!

tortoise said:
BlueGene is a totally superior architecture from top to bottom, and scalable in far more ways than a cluster.


This quote doesn't make it sound that "totally superior" at all:

Don Dossa, a computational physicist at the Livermore lab, said in an earlier interview that Blue Gene/L is well adapted for some calculations but not for others.

For example, it's appropriate for simulating materials in a relatively unchanging environment, such as a solid that's cracking under pressure. But when it comes to the nuclear weapons simulations at the heart of the lab's mission, the 512MB of memory in each Blue Gene/L computing node isn't enough to deal with the broad range of radiation, temperature and pressure conditions, Dossa said.

For those nuclear weapons simulations, IBM is building a system called ASCI Purple that uses a smaller number of more powerful computers that rely on IBM's Power5 processor.

http://news.com.com/IBM+claims+fastest+supercomputer+title--for+now/2100-7337_3-5388797.html

So, IBM is building a cluster with 12,544 POWER5 CPUs to do jobs that Blue Gene/L isn't suitable for....
 
AidenShaw said:
This quote doesn't make it sound that "totally superior" at all


Actually, I agree, and was mixing points. BlueGene is not really a general purpose supercomputer, though it does share some of the features of one. It is more of a hybrid architecture that splits the difference but is still optimized for a particular subset of applications.

One could argue that the determinant of what is and isn't a general-purpose supercomputer has less to do with how many processors it has and a lot more to do with memory architecture. Not having a single massive address space limits the applications you can use a supercomputer on effectively. Something like an SGI Altix system is more of along the lines of what I was thinking of. I think there is a trend toward more specialized architectures and applications in supercomputing, which is giving us things like BlueGene, Octiga Bay (now Cray XD1), with application optimized interconnects, FPGAs, etc.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.