The Real Cost of Downtime
Recent economic shifts have affected many in the tech sector; while some of us are seeing significant reductions with hiring freezes and small layoffs, others are continuing business as usual. However, this zeitgeist has all of us looking at expenses that feel not reasonably necessary.
Harvard Business Review said in a recent piece that in times of recession, consumers separate all expenses into the following categories:
When talking about software and services, we don't have treats (beyond a subscription to heytaco), but certainly, we separate our subscription into "Essentials," "Postponable's," and "Expendables."
The question is, what's expendable? I don't think most of us consider operations tools expendable. Anything that ensures reliability certainly isn't "Expendable." But might it not be "Postponable"? Shouldn't we focus on new features and worry about 'tech debt' when we have more resources to spend?
Perhaps our team is in a hiring freeze. If that's the case, why bother investigating failures and bugs if we don't have the staff to fix the problems anyway?
"Tech debt" isn't something that people like to accumulate. Still, an intelligent engineering manager knows it's inevitable. During a recession, it can be very tempting to move a lot of monitoring, testing, and observability down to that third category: Postponable's.
Like a driver deciding they can drive with less tread on their tires, we can get through hard times by moving maintenance just a bit out of schedule. In many instances, this is true! A downturn with complex revenue problems isn't the right time to re-engineer your platform if it's working okay. And if you skip driving in the rain, your tires may go another thousand miles.
This article talks about what a catastrophic failure can cost.
Let’s talk about the real cost of downtime.
In a post last year from Atlassian, the cost of downtime was roughly estimated at thousands of dollars per minute with an hour of downtime costing half a million dollars.
- Lost productivity - time spent fixing the problem
- Lost revenue - customers citing downtime as a reason for churn
- Brand reputation damage
- Data Loss - When recovering from downtime, there's almost always some data lost or needs repair
With these factors, even a small or medium business can lose massive amounts of money with prolonged downtime.
It's critical to consider that the length of downtime matters: the longer the system is down, the more time there is for data to get out of sync, for key accounts to notice the problem, and for the root problem to conceal itself under cascading failures.
We often try to ask things like 'the cost of a single outage,' but this doesn't consider how much a prolonged outage is worse than a very short one.
Downtime has more than immediate costs.
What's the cost of stress? I know that none of us loves getting a late-night ping from our monitoring tools, but the pressure of a ping from the head of sales saying 'logins are failing' with no other details is a *lot* more stressful.
The amount shown above represents the cost of downtime in terms of how it affects income. But several other costs may take more than a quarter to hurt your bottom line drastically.
What’s your best engineer worth?
One of the terrible things about downtime is that it asks the most from those with the least to give. Extremely busy Enterprise reps will need to handle communications, and your best engineers will need to look into the problem.
No one would argue that downtime should be left to the interns to fix, but it can be maddening when repeated downtime means your best engineers are constantly being woken up in the middle of the night to try and find the source of a new problem.
Once our best Operations people are done getting the service back up, it will also take our best data people to clean up whatever irregularities were caused by the outage (remember that 'data loss' line item in the list above?).
So the following day our best engineers will be tired and stressed. Hopefully, they'll take the morning off to recharge, leaving them with less time to debrief with their teams. Repairing the immediate cause of the problem leaves less time to train everyone in how the whole system works. This can lead to a negative feedback look where most of the team only understands small parts of the system (enough to add features), and only a select few have some grasp of the whole system (enough to fix failures).
Sadly this whole setup has a negative feedback loop: more time spent putting out fires leaves your best people isolated from the bulk of the product and operations team. And that gets us to a problem no engineering manager wants to deal with.
When a team has to make the sad choice to reduce costs, almost all leaders will do that through layoffs. Often an upset team will ask, 'why not just offer us all a small pay cut to reduce costs by the same amount?' the problem with this is the dreaded negative selection.
> note, in the following paragraph, I talk about the reality of layoffs and how the laid-offs are selected. If you've been laid off personally, please remember that your value to a particular company is not your value as a person.
Every employee is worth something to a company. With layoffs, leadership tries to select the people least necessary to the company mission, the least valuable, to remove. The most and least valued employees are affected if everyone gets a pay cut. The problem is that some people will leave the company when faced with a pay cut. And further, the leavers are much more likely to be those who can most easily get a job elsewhere. That group is likely to include your most valuable employees. In extreme cases, negative selection means only your worst performers stay.
Negative selection is at play again when we are a technical team plagued by outages. With late-night calls, difficult root cause analysis, and a constant state of crisis, downtime creates a condition of negative selection within your team. People who know they could leave for a better team with less stress will go, and those who don't have other good prospects will stay. Every spate of downtime weakens your team, and only a team that proactively handles problems before crises can hope to hold on to the best engineers.