SLA: How many 9's do I need?

Introduction

I recently had a conversation with a colleague regarding service level agreements and what kind of up-time SLAs we were required to provide (or would recommend) to some our customers. This is something that comes up more and more, particularly in relation to software delivery on cloud hosting platforms. Azure, Amazon AWS, Open Stack, Rack Space, Google App Engine, and so on all offer ever increasing levels of improved up-time around their cloud offerings and this trickles down to the ISVs who build software on these platforms. So how many 9’s does your organisation’s system need ?

Percentage availability

Availability is the ability for your users to access or use the system. If they can’t access it because it’s locked up, or offline, or the underlying hardware has failed, then it is unavailable.

For the uninitiated, measuring availability in 9’s is industry parlance for what percentage of time your application is available. The following table maps out the equivalent allowed downtime described by those numbers.

Description Up-time Downtime per year Downtime per month
two 9’s 99% ~3.65 days ~7.2 hours
three 9’s 99.9% ~8.7 hours ~43 minutes
three and a half 9’s 99.95% ~4.3 hours ~21 minutes
four 9’s 99.99% ~52 minutes ~4.3 minutes
five 9’s 99.999% ~5.25 minutes ~25 seconds

Service Level Agreements

How many 9’s a company or services’ SLA specifies, does not necessarily mean that the system will always adhere to or guarantee that level of up-time. No doubt, there are mission critical systems out there that would need guaranteed/consistent up-time and multiple layers of fail-over/redundancy in case those guarantees are not met. However, more often that not, these numbers are goals to be attained, and customers might be offered a rebate/credit if the availability did not reach those goals.

Take Amazon S3 storage services for example. Their service commitment goal is to maintain a three 9’s level of up-time in each month, however in the event that they do not, they offer a customer credit of:
- 10% in the case where they drop below three 9’s
- 25% in the case where they drop below two 9’s

Microsoft Azure has a similar service commitment for their IaaS Virtual Machines. In this case, while they offer a similar credit rebate for dropping below, 99.95% they also caveat that you must have a a minimum of 2 virtual machines configured in an availability set across different fault domains (areas of their comm center infrastructure that ensure resources like power & network are redundantly supplied).

What are your requirements?

Our business is predominantly focused on providing our customers with line of business applications. The large majority of their usage is by end-users between 8 am and 6 pm on business days. As a result, we have a level of flexibility with our customers to co-ordinate releases, planned outages and system maintenance in a way that minimally impacts the user base.

In the past however, I’ve built and maintained systems that were both financially and time critical; SMS based revenue generation based on 30 second TV ad spots for example have a very different business use case, requiring a different level of service availability. If you're system is offline during the 90 second window from the start of the advert, then you risk having lost that customer.

When identifying your own requirements, you need to think about the following:

  • When do you need your system or application to be available?
  • Do you have different levels of availability requirements depending on time of day, month or year?
    • LOB application that needs to be available 9-5/M-F
    • FinSrv application required for high availability at end of month but low availability through out the month
    • An e-commerce application requiring 24/7 availability across multiple geographic locations & overlapping timezone
  • What are the implications for your system being unavailable?
    • Are there financial implications?
    • Is the usage/availability time critical/sensitive?
    • Are other systems upstream/downstream dependent upon you and if so, what SLA do they provide?
  • If one component of your system is unavailable, is the entirety of the system unusable?
    • Is component availability mutually exclusive?

The cost of higher levels of availability

Requiring higher levels of availability (more 9’s) means having a more complex, robust and resilient hardware infrastructure and software system. If your system is complicated, that may mean ensuring that the various constituent components can each, independently satisfy the SLA. e.g.

  • Clustering your database in a Master-Master replication setup over multiple servers
  • Load-balancing your web application across multiple virtual machines
  • Redesigning to remove single points of failure in your application architecture such as in process session-state
  • Externalising certain services to 3rd parties that provide commercial solutions. (Azure Service Bus, Amazon S3 Storage etc…)

And all these things comes with a cost.

Johns E-Commerce Site

John runs an e-commerce website where he sells high value consumer goods. During the year his system generates ~€12m in revenue. Over the course of the year up-time equates to the following average revenue earnings, however since his business is low volume/high margin, missing a single sale/transaction could be costly.

  • €1,000,000 per month
  • €33,333.33 per day
  • €1,388.89 per hour
  • €23.15 per minute

John’s application currently only offers two 9’s of availability as it’s implemented on a single VPS and has numerous single points of failure. Planned outages are kept to a minimum but required to perform updates, releases and patches.

John is considering attempting to increase his platforms availability to four 9’s. Should he do it?

Quantifying the value of higher levels of availability

If you take a purely financial view of John’s situation, the cost implications of two 9’s vs. four 9’s is significant.

SLA Outage Window Formula Total Cost of Max. Outages
99% 3.65 Days 3.65 * €33333 €121,665.45
99.99% 52 minutes 52 * €23.15 €1,203.80

Ultimately, he needs to understand if this is an accurate estimation of the cost impact, and if it is, would it cost him more than €120K year on year, to increase the up-time of his system. There are numerous other business and technical considerations here on both sides of the equation.

  • Revenue estimation year on year may or may not be accurate
  • Revenue generation may not be evenly distributed through the year; if he can maintain high availability through the Black Friday and Christmas shopping seasons, it may alleviate most of his losses.
  • There may be other less tangible impacts on recurring revenue due to bad user experiences of arriving while the site is down etc.
  • Downtime may have a detrimental/negative impact on his brand.

On the other hand, what is the cost of the upgrade.

  • Development costs to upgrade the system.
  • Additional hosting costs to move to a cloud a platform or additional 3rd parties
  • On-going support costs to maintain this new system
  • There may be other considerations where the adoption of new technologies (a high availability cache) would alleviate the necessity of an increased SLA for a data store for example.

Assuming that the system can be initially upgraded and maintained year on year for less than €120K, the return on investment would make sense for John to undertake this work. It would be a different conversation the next time though when he wants to go to five 9’s availability.

Thoughts?

Deciding on an appropriate level for your SLA is complicated, and there are a myriad of considerations and inputs which will dictate the “right” answer for your particular situation. Whatever you decide, attempting to achieve higher and higher levels of availability for your system, will most probably lead to higher costs, and the smaller returns on investment. So make sure the level you choose is appropriate from both a business and technical perspective.

~Eoin Campbell

Eoin Campbell

Eoin Campbell
Dad, Husband, Coder, Architect, Nerd, Runner, Photographer, Gamer. I work primarily on the Microsoft .NET & Azure Stack for ChannelSight

CPU Spikes in Azure App Services

Working with Azure App Services and plans which have different CPU utilization profiles Continue reading

Building BuyIrish.com

Published on November 05, 2020

Data Partitioning Strategy in Cosmos DB

Published on June 05, 2018