Database Reliability

August 04, 2010

Wednesday, August 4, 2010

The database is usually a critical piece of infrastructure in an organization; when the database is down, many things stop working, so database reliability is often a serious concern. While the reliability of database software is important, for Postgres it is often the infrastructure that Postgres depends on that causes outages, not Postgres itself. We see this regularly on the Postgres email lists.

To get started, a fundamental assumption has to be discarded — that computers are abstract machines and always do what they are told. While we often treat hardware as abstract devices, in reality they are physical, and are susceptible to failure just like any physical entity.

So, what things can go wrong? First, consider disk drives. This Slashdot comment explains how non-atomic disk drive really are:

 

Disks have a lot, and I mean a LOT of ECC on them. It is not a situation of "I need to write a 1 so I'll place one at this location on the drive." They use a complex encoding scheme so that bit errors on the disk don't yield data errors to the user. ... Then there's the fact that bits aren't even stored as bits really. ... They are written using flux reversals, but the level is not carefully controlled, it can't be. So when you read the data the drive actually looks at an analogue wave.

 

Yikes — that certainly makes me feel less confident about disk storage, and its abstract, predictable behavior. Many people are concerned about disk drive, so many people use inexpensive RAID controllers and believe their reliability issues are solved. Then there is the problem of disks not storing data permanently before acknowledging the write (writeback caching); this is covered extensively in the Postgres manuals.

RAM is also not an abstraction — it can fail too. This article does a great job of diagnosing an executable that suddenly stopped working due to a RAM error.

Finally, the motherboard can fail too. Last month there were extensive reports about Dell distributing faulty motherboards, and covering up the fact, with failure rates up to 97%!

All this information underscores how fragile hardware can be, and highlights that hardware reliability is a significant aspect of overall database reliability.

Update: Dell has settled the court case over covering up the shipment of defective motherboards. 2010-09-26

Update: More details on the Dell cover up. 2010-11-24

Share this