Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

mkrishnan

Moderator emeritus
Original poster
Jan 9, 2004
29,776
15
Grand Rapids, MI, USA
http://news.cnet.com/8301-30685_3-10370026-264.html?tag=newsLeadStoriesArea.1

"We found the incidence of memory errors and the range of error rates across different DIMMs (dual in-line memory modules) to be much higher than previously reported," according the paper jointly written by Bianca Schroeder, a professor at the University of Toronto, and Google's Eduardo Pinheiro and Wolf-Dietrich Weber. "Memory errors are not rare events."

How many errors? On average, about one in three Google servers experienced a correctable memory error each year and one in a hundred an uncorrectable error, an event that typically causes a crash.

That may not sound like a high fraction, but bear these factors in mind, too: each memory module experienced an average of nearly 4,000 correctible errors per year, and unlike your PC, Google servers use error correction code (ECC) that can nip most of those problems in the bud. That means an correctable error on a Google machine likely is an uncorrectable error on your computer, said Peter Glaskowsky, an analyst at the Envisioneering Group (and member of CNET's blog network).

Like their study on hard drive failures and operating temperature (which generally showed that temperature was not as big a factor as generally thought in drive failure), it's great to see them contributing some interesting information on real-world hardware performance and durability from their own operations.
 

John Jacob

macrumors 6502a
Feb 11, 2003
548
9
Columbia, MD
I had read the article, and I found it interesting. One of my choices of topic for my masters thesis (not the one I finally took up, though) was on linker level software techniques for detecting and recovering from transient errors (bit flips due to cosmic rays). The Google study shows that transient errors are not as common as thought.
 

mkrishnan

Moderator emeritus
Original poster
Jan 9, 2004
29,776
15
Grand Rapids, MI, USA
Yeah, does make you wonder a little bit about the move away from ECC in the consumer / non-server world. Did your final thesis topic have anything else to do with managing hardware anomalies, or was it some other software-side project?
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.