:
A friendly little bird who was there gave this to me. Posted with all comments intact. She didn't understand everything they were talking about; maybe some of you will.
-----
The following are notes taken from a powerpoint presentation given at 11AM
9/4/2003
more info can be found at
http://www.computing.vt.edu/research_computing/terascale
"Terascale Computing Facility"
Opening remarks:
-An advisory committee has been appointed for governance of use. [Yay.]
Slide One
Computational Science and Engineering Institute[?]
Goals:
- to build a world class facility
- to provide high performance network to tie in with computational grids
-connect supercomputers, visualizations, and data storage
Slide Two
Goals and Scope
1. support research in computational science and engineering
2. dual usage: production & experimental (apps)
3. create beneficial collaboration [scribble scribble, I need to write things I can read]
Slide Three
TCF
- based on 64 bit architechture
- employs high bandwidth low latency communications fabric
- operational for production [apps?] in Fall 2003; fully operational by the end of the year
Slide Four
Choosing the Right Architechture
- cost vs. performance (purely)
- total cost $5.2 million includes system itself, memory, storage, and communication fabrics
- one of the cheapest systems of its kind
Slide Five
Architectural Options [or something like that]
- Dell - too expensive [one of the reasons for the project being so "hush hush" was that dell was exploring pricing options during bidding]
- Sun (sparc) - required too many processors, also too expensive
- IBM/AMD (opteron) - required twice the number of processors and was twice the price in the desired configuration; had no chassis available
- HP (itanium) - ditto
- Apple (IBM PPC970) - system available with chassis for lowest price
Slide Six
Nodes
- dual PPC970 2GHz
- Each node has:
- 4 GB RAM
- 160 GB serial ATA
- 176 TB total secondary storage
- 4 head nodes
- 1 management node
- most powerful "homebuilt" supercomputer in the world
Slide Seven
Reliability
- commodity clusters have issues due to the large number of units
- VT developed transparent fault tolerance system called "Deja Vu"
- collaborated with PSC
- can recover from just about every failure, i.e. someone hits the wrong switch, OS crashes, things fail in general, power loss, etc
- This system has been ported to the G5 and will be deployed in the TCF
Slide Eight
Primary C [omputer??] Architechture
- working with Mellanox for infiniband solutions
- the system is [obviously] based on infiniband technology
- full switch network 20 Gbps, full duplex
- 24 96 port switches in "fat-tree topology"
Slide Nine
Secondary Com[????munications? ...puter?]
- Gigabit fast ethernet management backplane
Slide Ten
National Lambda Rail (nationwide optical network)
- all networking equipment [at least for this locale] is CISCO
- the following organizations are involved with NLR:
-CENIC
-CISCO
-Duke
-Florida Consortium
-Georgia Tech
-Internet2
-MATP
-PNWGP [pacific northwest group]
-Texas [I imagine a university, not the whole state. "Yes, Texas backs National Lambda Rail, yessir."]
- additional player: PSC [yeah I don't know either]
- VT leads Washington DC point of presence
- DC node goes active in the first half of 2004
Slide Eleven [maybe]
Software
- Mac OSX
- Why not linux? Not enough support.
- Mellanox does Inifiniband drivers and HCA
- MPI (parallel communications libraries)
- Argonne National labs to get MPI-2 for the system
- C, C++ compilers - IBM xlc and gcc 3.3
- Fortran 95/90/77 Compilers - IBM xlf and NAGWare
Slide Twelve [I should give up soon]
Sustainability Model
(organizations that could make use of the facility or have already expressed interest)
- Federal organizations
-NSF, CyberinfrastructureProgram
-NIH, DOE, DARPA, DoD, AFOSR[??], ONR [?????]
- Industry (the system can attract industrial interest)
- External Research Partners
- National Labs, Supercomputer Centers, NASA, NIA
Slide Thirteen [I never learn]
Access
- internal access not based soley on research funding contributed
- priorities might be established based on contribution at a later time
- provide easy access for investigators [I missed the end of this line]
- external access determined on a cost recovery basis
Slide Fourteen
Future
- Computational Science and Engineering is a long-term project
- Current facility will be followed with a second in 2006
Slide Fifteen
Timeline
- Oct. 1st - preliminary operations
- Oct. 1st - Mid Nov. - performance optimization and benchmarking
- Mid Nov. - available for initial apps ("hero-users" [heh, i.e. the poor suckers who test out the initiall config])
- available to any user with operating MPI coverage [huh?]
- Jan 2004 - fully operational
PART II of presentation
-insert lots of information here
VT Op. Center will be staffed 24-7
Facility
- 3 MW power, double redundant with backups - UPS and diesel
- 1.5 MW reserved for the TCF
- 2+ million BTUs of cooling capacity using Liebert's extreme density cooling (rack mounted cooling via liquid refrigerant)
-traditional methods [fans] would have produced windspeeds of 60+ MPH
--insert photos of cool stuff here- facilities, a G5, cooling rack set up-
Racks were custom designed for this system
Usage
- dual usage - production and experimental
- experimental operations will not interfere with production use
CSE Research Avenues
- nanoelectronics
- quantum chemistry
- computational chemistry/biochemistry
- fluid dynamics
etc, etc, etc
PART III
Services [Offered at the Facility]
- code development assistance
- general support
- grant writing support
[missed an item here]
Code Development Assistance
- HCSE
- Code kitchens w/apple and others
- FDI - expanding research tracks
Housed in AISB Machine Room
- Sysadmins will be available 24/7/365
- Tiered support
- Grant writing support
- Marketing and Business Rev[something.. enue?]
Official presentation ended here, begin Q and A!
How long would it take to parallelize the code?
- Highly application dependent
- can build in an afternoon or take weeks/months
- duties of the management node can be covered by any of the four head nodes
- The system is a distributed memory machine, not a shared memory machine
If 2 battling applications - no interference in terms of communication
What kind of physical security for the facility?
-access to the building via keycard
-access to the machine room determined via biometrics
Online material concerning the project? Not at this time; probably up later today or late tonight
Does it render? Yes, incidentally, it does. The units came with high end graphics cards
- Plans to connect with The CAVE? Maybe.
Source of funding for the project? From different colleges within the university.
Are there limitations on the level of complexity that the system can be used for? Not yet fully considered, but no web servers and no hosting an irc network
Why so secret? Project started back in February; secret with Dell because of the pricing issues; dealt with vendors individually because bidding wars do not drive the prices down in this case
Deja vu does not do load balancing.