How to Reduce the Costs of Downtime

October 27, 2022

by

in

Following up on my previous piece on the real cost of downtime, I wanted to take a minute to list three ways to reduce the cost of downtime.

1. Write perfect code

This will never happen, and I put it in as a joke. First off, no code is perfect. Secondly: platform failures will still happen. Thirdly, by emphasizing stability while writing code, we often sacrifice development speed through over-optimization. This breaks the agile methodology that we all live by. Okay, so this one isn’t going to help.

2. Create more robust systems

You may ask: “Now wait just a minute, you just said that it wasn’t possible to write code that never failed, and now you’re asking me to write a more robust system. Isn’t that just another term for the same thing?”

Gentle readers are almost the opposite of each other. While an emphasis on perfect code asks that we make everything work perfectly the first time, a robust system is one that expects failure.

What robustness looks like in your team is going to vary greatly. Massive organizations with big budgets and an imperative to outperform all others can do stuff like Chaos Engineering to simulate platform failures, bugs, and other impossible-to-predict failure states.

But even the smallest team can use feature flagging, multi-availability-zone platforms, and of course, Observability to improve how their system handles failure.

This isn’t my original thought; check out Martin Fowler’s writing on observability and feature flagging.

The value of finding blind spots

Observability is a major component here: with a well-observed system, you can identify problems before they cause outages. Even better: when trying to set up observability components like Distributed Tracing, you’ll often identify blind spots. Things your team doesn’t understand fully will be revealed when you want to trace a normal request. If you resolve those blind spots before you have an outage, the speed with which you will identify the root cause during an outage will reduce dramatically.

3. Shorten the length of downtime

Here, in order of best to worse, are the top five ways to find out about downtime:

Noticed by observability tools, alert sent to Ops
“Hey guys, are we down?” sent on the company Slack
Sales or Support notices that something seems wrong and informs the product/ops team
Users report a problem to Support
Sales experiences downtime during a demo

Scenarios 2 and 3 look similar only if you’ve never worked in Sales or Support: an outage often looks like something else to the users, so if an internal team identifies the problem before users are aware, all that damage to brand reputation can be prevented.

These scenarios are also progressively worse because of how little information we have about the problem as we descend. When an observability tool finds a problem, we’re often very close to the root cause. But a text message from a sales rep saying, ‘site’s down?!?’ means we have to start looking from square one.

All these factors contribute to the length of time that your downtime lasts. This matters because, out of all the ways you can find out downtime, only tools-based observability will let you resolve downtime in a way that doesn’t incur significant costs.

It’s possible that internal teams might notice downtime within a few minutes. Even so, they’ll be starting without much knowledge about the nature of the problem. Generously we can assume that investigation will take a quarter of an hour. By Atlassian’s estimate, that investigation stage will cost tens of thousands of dollars.

If you need your users to notice the problem and for one of them to be annoyed enough by the problem to contact you, you’ll be waiting at least 20 minutes. Unless you’re working out of a garage, their complaint won’t reach an engineer directly, meaning 5-10 minutes to replicate the problem and inform Ops. Going off our standard of thousands of dollars per minute, an outage reported by users has a base price of one hundred thousand dollars.

The solutions to downtime

Beyond improving how downtime is detected, some of the measures mentioned above can help with how well your whole team understands your service. Simple measures like knowledge sharing, architectural overviews, and team brown bag lectures can vastly improve your time to resolve an issue.

Essentially, anything that improves your technical team’s bus factor will also shorten the length of outages. If one person leaving will cause problems for your team, you’ll experience a mini version of that same issue every outage: unless your best experts are always on call, whoever first detects the problem will be greatly helped if they have some understanding of the systems in play.

Ready to Optimize Your App?

Join engineering teams who trust Scout Monitoring for hassle-free performance monitoring. With our 3-step setup, powerful tooling, and responsive support, you can quickly identify and fix performance issues before they impact your users.

Start Monitoring for Free

How to Reduce the Costs of Downtime

1. Write perfect code

2. Create more robust systems

The value of finding blind spots

3. Shorten the length of downtime

The solutions to downtime

latest Posts

Scout helps DynaBliss build medical practice management software

Chaskiq Improves Performance 2x with Scout

Mid-Year Update 2025

ForAll Systems Saves Money and Improves Response Times 3X with Scout!

The Architecture Loop: How Early Can We Decide Speed, Stack and Scale?

IETF Decreased Mean Response Time by 90% with Scout APM!

The Architecture Loop: MVC and the Hidden Costs of Microservices

May Newsletter

How Zartis Drives Application Modernization with Scout Monitoring

Rails Apps and Slowdowns: How Scout Shows what Databases Don't

Related posts

Rails Apps and Slowdowns: How Scout Shows what Databases Don't

Caching Strategies for Ultra-High Performance in Ruby on Rails, Part 2

Securing Ruby Applications with mTLS

Ready to Optimize Your App?

Monitoring

Features

Resources

Company