How to Reduce the Costs of Downtime

Following up on my previous piece on the real cost of downtime, I wanted to take a minute to list three ways to reduce the cost of downtime.

1. Write perfect code

This will never happen, and I put it in as a joke. First off, no code is perfect. Secondly: platform failures will still happen. Thirdly, by emphasizing stability while writing code, we often sacrifice development speed through over-optimization. This breaks the agile methodology that we all live by. Okay, so this one isn’t going to help.

2. Create more robust systems

You may ask: “Now wait just a minute, you just said that it wasn’t possible to write code that never failed, and now you’re asking me to write a more robust system. Isn’t that just another term for the same thing?”

Gentle readers are almost the opposite of each other. While an emphasis on perfect code asks that we make everything work perfectly the first time, a robust system is one that expects failure.

What robustness looks like in your team is going to vary greatly. Massive organizations with big budgets and an imperative to outperform all others can do stuff like Chaos Engineering to simulate platform failures, bugs, and other impossible-to-predict failure states.

But even the smallest team can use feature flagging, multi-availability-zone platforms, and of course, Observability to improve how their system handles failure.

This isn’t my original thought; check out Martin Fowler’s writing on observability and feature flagging

The value of finding blind spots

Observability is a major component here: with a well-observed system, you can identify problems before they cause outages. Even better: when trying to set up observability components like Distributed Tracing, you’ll often identify blind spots. Things your team doesn’t understand fully will be revealed when you want to trace a normal request. If you resolve those blind spots before you have an outage, the speed with which you will identify the root cause during an outage will reduce dramatically.

3. Shorten the length of downtime

Here, in order of best to worse, are the top five ways to find out about downtime:

  1. Noticed by observability tools, alert sent to Ops 
  2. “Hey guys, are we down?” sent on the company Slack
  3. Sales or Support notices that something seems wrong and informs the product/ops team
  4. Users report a problem to Support
  5. Sales experiences downtime during a demo

Scenarios 2 and 3 look similar only if you’ve never worked in Sales or Support: an outage often looks like something else to the users, so if an internal team identifies the problem before users are aware, all that damage to brand reputation can be prevented.

These scenarios are also progressively worse because of how little information we have about the problem as we descend. When an observability tool finds a problem, we’re often very close to the root cause. But a text message from a sales rep saying, ‘site’s down?!?’ means we have to start looking from square one.

All these factors contribute to the length of time that your downtime lasts. This matters because, out of all the ways you can find out downtime, only tools-based observability will let you resolve downtime in a way that doesn’t incur significant costs.

It’s possible that internal teams might notice downtime within a few minutes. Even so, they’ll be starting without much knowledge about the nature of the problem. Generously we can assume that investigation will take a quarter of an hour. By Atlassian’s estimate, that investigation stage will cost tens of thousands of dollars.

If you need your users to notice the problem and for one of them to be annoyed enough by the problem to contact you, you’ll be waiting at least 20 minutes. Unless you’re working out of a garage, their complaint won’t reach an engineer directly, meaning 5-10 minutes to replicate the problem and inform Ops. Going off our standard of thousands of dollars per minute, an outage reported by users has a base price of one hundred thousand dollars.

The solutions to downtime

Beyond improving how downtime is detected, some of the measures mentioned above can help with how well your whole team understands your service. Simple measures like knowledge sharing, architectural overviews, and team brown bag lectures can vastly improve your time to resolve an issue.

Essentially, anything that improves your technical team’s bus factor will also shorten the length of outages. If one person leaving will cause problems for your team, you’ll experience a mini version of that same issue every outage: unless your best experts are always on call, whoever first detects the problem will be greatly helped if they have some understanding of the systems in play.