As companies become more and more dependent upon their information systems just to be able to function, the availability of those systems becomes more and more important. Outages can costs millions of dollars an hour in lost revenue, let alone potential damage done to a company’s image. To add to the problem, a number of natural disasters have shown that even the best data center designs can’t handle tsunamis and tidal waves, causing many companies to implement or re-evaluate their disaster recovery plans and systems. Practically every customer I talk to asks about disaster recovery (DR) and how to configure their systems to maximize availability and support DR. This series of articles will contain some of the information I share with these customers.
The first thing to do is define availability and how it is measured. The definition I prefer is availability represent the percentage of time a system is able to correctly process requests within an acceptable time period during its normal operating period. I like this definition as it allows for times when a system isn’t expected to be available such as during evening hours or a maintenance window. However, that being said, more and more systems are being expected to be available 24x7, especially as more and more businesses operate globally and there is no common evening hours.
Measuring availability is pretty easy. Simply put it is the ratio of the time a system is available to the time the system should be available. I know, not rocket science. While it’s good to measure availability, it’s usually better to be able to predict availability for a given system to be able to determine if it will meet a company’s availability requirements. To predict availability for a system, one needs to know a few things, or at least have good guesses for them. The first is the mean time between failures or MTBF. For single components like a disk drive, these numbers are pretty well known. For a large computer system the computation gets much more difficult. More on MTBF of complex systems later. Then next thing one needs to know is the mean time to repair or MTTR, which is simply how long does it take to put the system back into working order.
Obviously the higher the MTBF of a system, the higher availability it will have and the lower the MTTR of a system the higher the availability of the system. In mathematical terms the system availability in percent is:
So if the MTBF is 1000 hours
and the MTTR is 1 hour, then the availability would be 99.9% or often called 3
nines. To give you an idea about how
much down time in a year equates to various number of nines, here is a table
showing various levels or classes of availability:
Availability | Total Down Time per Year | Class or # of 9s | Typical application or type of system |
90% | ~36 days | 1 | |
99% | ~4 days | 2 | LANs |
99.9% | ~9 hours | 3 | Commodity Servers |
99.99% | ~1 hour | 4 | Clustered Systems |
99.999% | ~5 minutes | 5 | Telephone Carrier Servers |
99.9999% | ~1/2 minute | 6 | Telephone Switches |
99.99999% | ~3 seconds | 7 | In-flight Aircraft Computers |
As you can see, the amount of allowed downtime gets very small as the class of availability goes up. Note though that these times are assuming the system must be available 24x365, which isn’t always the case.
More about high availability in my next entry.