Availability: How many 9s are enough?

By: Bob Landstrom

In this post we’ll discuss this term “availability,” as it applies to the data centre world and give some perspective on this notion of the number of “nines.”

Availability vs. Reliability

Let’s first talk about this term, “availability,” and how it is different from the better understood term, “reliability.”

Availability is a probability (though a conditional one…) that the system delivers the required service at the time it’s called upon to do so.  Availability considers that when the system fails, it can be repaired and restored for service.  That is, it includes the time to repair, and this is a critical aspect of availability.  Availability is often expressed in terms of some number of nines.  “Three 9’s,” for example, is the same as saying “99.9% availability.”

This is different from “Reliability,” which is simply the probability of the system will perform its function for a given period of time, under certain conditions.  Reliability is often expressed in terms of Mean Time Between Failure (MTBF) or Failure Rate.  Unlike availability, reliability is not a conditional probability, and does not include maintenance or repair.

Availability in Real Life

Let’s use an example of 99.9% (three 9’s) availability, to see what availability could mean in more tangible terms. 

  • 99.9% availability means, for example:
  • 44 minutes of unsafe drinking water per month
  • 3 crash-landings per week at Heathrow
  • 3,000 letters lost by the Postal Service every hour
  • 2,000 surgical mistakes in the NHS every week.
  • 9,000 incorrect banking debits per hour
  • 36,000 missed heartbeats per year (9 hours)

These are all different scenarios, and all unacceptable, but perhaps surprisingly, all pertain to the same availability value.

Let’s look at this from a slightly different angle.  This table shows the amount of time annually that a system is down (or unavailable), given the number of nines availability that system has.  We see here that even a 5-nines system has on average, over five minutes of downtime annually.

% Availability Amount of time unavailable, annually
99% 88 hours
99.9% 8.8 hours
99.99% 53 minutes
99.999% 5.3 minutes
99.9999% 32 seconds

It’s important to remember that the true availability performance of a data centre is not based solely upon engineering and certifications, but also requires superior operational processes and discipline.  A skilled and well trained data centre operations team, mature MOPs & SOPs, disciplined security processes, and superior engineering combine to ensure superior availability performance.

In today’s always-on digital world, downtime translates into lost revenue.  Hypothetical tier ratings are interesting, but a demonstrated track record of strong availability performance, along with evidence of mature and disciplined data centre operations is important for minimizing risk to your business.