Concrete Steps to Reducing MTTR

In today’s data-centric world, metrics or numbers define all performance benchmarks. The time between when an event starts and ends shows how well a system can handle and process such events. One of such metrics is MTTR. MTTR usually stands for Mean Time To Resolution, but it has held several meanings over the years. MTTR is a metric used to measure how well a system can bounce back from errors and provide long-lasting solutions. In this guide, we will take a look at MTTR in detail and the best practices that you can follow to reduce MTTR to the minimum.

Guide Navigation:

What is MTTR?

As mentioned previously, MTTR stands for Mean Time To Resolution in modern contexts. It is one of the most widely used metrics for ensuring the reliability of a system. The MTTR metric defines how well a system can recover from an average error encountered during normal functioning.

On the other hand, MTTR is also among the most misunderstood metrics in the APM industry. Most developers and teams lack a clear understanding of the concept and how to use it best. More often than not, this results in frequent disconnects and non-uniform MTTR coverage around the clock. This directly drives up business costs and increases the complexity and risk faced in the software development process.

A high MTTR number means there are multiple hindrances in your standard issue-resolution process, and you may need to rework your recovery strategy. At the same time, several other factors, such as incident frequency, traffic during the occurrence, number of concurrent issues when the incident occurred, etc., can influence an individual MTTR reading. Therefore it is essential to take this data with a grain of salt and focus on coupling it with other details like app usage, affected users, and traffic to better understand the impact that metric has on your application’s usage.

undefined
Source

Over the years, the abbreviation MTTR has held multiple meanings. From measuring how long it takes for companies to fix errors immediately to fixing the root cause of errors to ensure it never happens again, MTTR has seen a lot since its inception. Here are some of the most common meanings of MTTR and their importance to an organization’s business:

Mean Time to Resolution

Mean Time to Resolution is the most common understanding of MTTR in the most recent times. Mean Time to Resolution refers to the time spent between identifying an issue and implementing a fix that may prevent similar errors from happening again. In Mean Time to Resolution, the focus is on the bigger picture - implementing fixes that stop future re-occurrences of the error. This process may take extra time to figure out the perfect solution and extensive testing before flagging the issue as resolved.

Mean Time to Resolution is considered a dynamic and versatile metric that can help organizations solve the root of the problem rather than just counter the immediate effects. This is why Mean Time to Resolution is considered one of the most powerful and dynamic versions of MTTR today.

Mean Time to Repair

Mean Time to Repair is the age-old definition of MTTR used in traditional engineering environments. Mean Time to Repair refers to a repairable piece of equipment or a flawed system inside a piece of equipment. This can include disk drives, motherboard chipsets, heatsinks, etc. At least one technician or trained personnel is involved in the repair process, and the target of the process is to get the equipment back to a properly functioning state. 

This metric does not strictly consider any measures that technicians may undertake to ensure that a similar failure does not happen again. More often than not, future robustness is inherently a part of the repair process but is not recognized independently.

Mean Time to Recovery

Mean Time to Recovery is, in short, a digitized version of Mean Time to Repair. Mean Time to Recovery refers to digital “objects”, which largely consist of computing-related software and hardware.

Even though Mean Time to Recovery is a spin-off of the old Mean Time to Repair, the process involved in repairing a digital system varies greatly from that of any other traditional engineering system or equipment. While you might have to repair a malfunctioning physical device as it is, you can employ intelligent methods in a computing environment to save time and money.

Since most of the digital resources are intangible, what matters to a user is their availability. Say you have an application provided to your users over the internet; the users are only concerned if your application’s URL points to the correct location and the application works well. The server that the software provides to the user’s browser or the hardware used to store and support the application delivery is not the customers’ concern. This is where intelligent management comes in. Instead of repairing downed server hardware on-site, operations teams can redirect traffic from an identical backup server to the users. This will provide the team with ample time to identify and fix issues and maintain uptime of your application for your end-users.

The Modern MTTR

As can be seen above, you can use MTTR in more than one context. But it is essential to remain in sync with the version of MTTR that resonates with the modern technology industry. The contemporary understanding of MTTR is Mean Time to Resolution and is the preferred definition.

How to Calculate MTTR?

Having understood the many meanings of the term, now it is time to understand how to calculate it. We will refer to Mean Time to Resolution as MTTR for the rest of this article for simplicity. 

MTTR talks about a mean or an average value. Therefore it is evident that the final result of the metric will be an average value, calculated from the data of a large number of incidents. 

If a vendor says that the Mean Time to Resolution for the services they offer is 4 hours, it does not imply that every issue within their systems will be fixed in under 4 hours. This only means that, on average, it takes them 4 hours to fix an issue. In isolated, borderline incidents, they might even take as long as 6 to 8 hours to fix the problem. This can happen due to several factors such as high risk, high traffic, uniqueness of the issue, etc. On the flip side, they might even resolve common, smaller issues in less than 2 hours. It is important to be careful with this metric and ensure that you are not misinterpreting anything.

Having understood the “mean” concept, the calculation of this data is fairly simple. To calculate the MTTR of your operations team/process, add up the total number of hours spent on resolving issues and implementing fixes over multiple incidents, then divide the sum by the number of incidents that occurred. This is calculated for any period by considering the data associated with only those incidents during the specified period.

Another important fact to remember about this metric is not to use it for planned incidents, such as service requests or maintenance downtime. MTTR is primarily used to gauge unplanned issue resolution.

The Business Case for Shortening MTTR

The MTTR metric holds the key to a business’s success. This is because the MTTR number of an incident response team directly dictates how well an organization can serve its digital products to the customers. A high MTTR number indicates that unplanned incidents might take longer than usual to resolve, leaving a sense of uncertainty to an application’s uptime.

The MTTR is often related to the customer experience (CX) aspect of a business. This means that your MTTR number answers the following questions:

The answer to these questions can directly affect the revenue that a business generates. Therefore it is important to reduce the MTTR duration to as small as possible, apart from reducing the number of issues across your digital products.

Concrete Ways to Reduce Mean Time to Resolution

Having discussed the meaning and impact of MTTR on a business’s growth, we can now understand the various ways in which an organization can reduce its mean time to resolution. These methods encompass the four key steps in fixing issues: detect, diagnose, fix, and recover. Let’s take a look at these methods in detail:

Actively Monitor to Reduce Detection Time

The first and foremost step to resolving issues fast is to identify them fast. Without a proper monitoring and alert setup, reducing MTTR is nearly impossible. If you don’t receive updates about an issue arising in your application on time, you will lose business due to frequent downtimes and slow maintenance. Here are some ways you can handle this step better:

Monitor Extensively

The most obvious way to identify issues faster is to monitor your application’s performance better. You can not fix something until you know it's broken. And even just knowing it broke is not enough. Modern systems are huge, and just knowing that something is off will probably cost you hours in pinpointing the bottleneck. This is why it is important to have an intelligent monitoring solution in place that can help you identify issues as quickly as they arise and provide enough metadata to locate the source of the issue quickly.

Scout APM is the perfect solution to your monitoring woes. Scout helps quickly identify, prioritize, and resolve performance problems like memory bloats, N+1 queries, slow database calls, and much more. Scout APM is one of the most reasonably priced, highly targeted tools you will encounter in the application performance monitoring domain. At the moment, ScoutAPM offers support for applications written in Ruby, PHP, Python, Node.js, and Elixir.

Reduce Alert-Noise and Calibrate Your Tools

Once you have a monitoring plan in place, it is important to maintain clear control over it. While most APM tools and providers offer to do this inherently, you must control your monitoring and alert preferences. This ensures that you are not losing any important alerts due to a wrong filtering scheme or getting bombed by requests throughout the day due to an extremely lenient alert plan. Finding the right balance for your use case is essential.

Another measure to implement is to set appropriate thresholds for the service level indicators (SLI) in your monitoring tools. These thresholds are simple metrics that indicate when a certain SLI is about to go over the typical value ranges. This setup can help predict issues and overloads before they happen and help fix issues even before real users encounter them.

Leverage AIOps Technologies to Identify and Resolve Incidents Faster

A recent buzz in the modern performance monitoring game is the use of AIOps. AIOps is a mixture of Artificial Intelligence with routine IT Operations. This helps in preventing incidents better and responding to them faster. AIOps utilizes machine learning and artificial intelligence capabilities to provide an intelligent stream of incident-related information and traditional telemetry data. This additional stream of information helps on-call teams to be better prepared for oncoming issues beforehand.

AIOps helps an incident response team in several ways. It provides a timely detection routine or anomalies. In some cases, such early detection can help solve issues before users even discover them. AIOps also pinpoints real problems from the pool of true and false alarms. Apart from identifying the right issue, AIOps can identify the right person to solve the problem. In some cases, AIOps can even suggest appropriate remedies for common issues, reducing the response time even further. 

If appropriately implemented, AIOps is a game-changing technology that can automate and speed up your IT Operations by a large magnitude.

Prioritize High-Impact Incidents

When you face a single active issue in your application, the approach appears quite simple. All you need to do is find a fix for it and implement that fix as soon as possible. But in real-life scenarios, problems don’t occur one at a time. In most cases, more than one issue arises that requires attention.

In these moments, it is important to prioritize before acting. While you may want to resolve issues in the order they occur, it might not often be the best response to the situation. You need to prioritize the critical incidents to your application’s functioning at all costs. Next, it is important to sort incidents by the number of users and engagements they affect and the tentative time to fix them. Such small decisions before beginning with the main job can help provide a relatively better response to the ongoing issues and help restore normalcy in a properly planned manner.

Have an Incident Management Plan in Place

One of the primary tools of any on-call team is their incident management plan. This plan consists of mandatory protocols in reconnaissance, diagnosis, communication, and other activities associated with an incident-resolution process. It is important to pay attention to all aspects of this plan and the incident resolution process, on the whole, to be able to respond to and resolve problems faster.

Maintain a Reliable Incident Management Action Plan

The first and foremost responsibility of the incident resolution team, as mentioned above, is to ensure that the incident management action plan is up to date with the current system specifications. This plan dictates what exactly has to be done if anything in the system goes off. Therefore this plan is extremely crucial to the incident resolution process.

There are multiple types of action plans that organizations prefer to implement. Some of the most common ones are:

As the name suggests, the Ad hoc approach relies on figuring out and bringing together a plan of action right after a problem occurs. This is not one of the most recommended ways to work; however, the Ad hoc approach seems to fit well in those situations where the organization is unwilling to invest resources in a dedicated on-call team.

The Rigid approach is the traditional ITSM (Information Technology Systems Management) approach often used by large organizations to maintain independent IT teams to help respond and resolve any issues that may arise unexpectedly. Such a system makes the process simpler but has a considerable cost associated with it.

The Fluid approach brings the best of Ad hoc and Rigid together, and most small to medium-large organizations use it. The on-call team is smaller in number and restricted in terms of knowledge but is highly skilled in communication and collaboration. The right resources and people are brought in just in time to aid the on-call team, which reduces the MTTR value and the cost involved.

Define Distinct Roles in Your Incident Response Hierarchy

Along with a proper action plan, you also need a well-defined team to execute the plan in time. The exact structure of your team will depend on the type of action plan that you choose to implement - Ad hoc, Rigid or Fluid.

In most cases, assigning a personnel lead of an incident is considered the best way to begin, as they can then communicate and research with full attention on the issues that have occurred. In many cases, assigning a direct responsibility helps resolve issues faster as people usually hesitate to take the lead voluntarily. You can further experiment with giving responsibility - assign technical and communications leads to support the incident lead. This can help facilitate technician scouting and ensure the stakeholders are well-informed of the situation.

The key takeaway is you need to think of the team hierarchy with your action plan to ensure things go smoothly during an incident.

Train the Incident Response Team Adequately

Now that you have a team and a plan, the next important thing is to have all of them comfortable with each other. The team needs to be well-versed with the jobs they have to do. The plan must be well tested and experimented with to ensure that no step falls out of place or proves ineffective during an actual issue.

If your organization follows the Fluid model of incident response, there is even more room in the training domain. Those who are not directly involved in your primary on-call team but fix issues during an incident need the proper training to solve the issues as efficiently as possible. You can identify some personnel as specialists in certain incident categories and train them intensively on those technologies to efficiently solve those issues.

Along with training people appropriately, it is also important to manage the human resources of your incident response team well. While the incident lead must be well aware of the technologies involved in a response event, he or she must not be the only person to do so. Having redundant resources (i.e., more than one person who is skilled in a particular system) ensures no chaos when the only engineer goes on a vacation or leaves the organization. This helps you build a dynamic and reliable team that can cover performance-related incidents throughout the year.

Practice Incident Response via Chaos Engineering

A popular method of testing out incident response teams without an actual application war-room is to emulate a war-room. Chaos Engineering is a methodology in which problems are intentionally and randomly injected into a system. This checks the system’s robustness and tests the reliability of the incident response team to handle the failures in as little time as possible.

Chaos engineering brings with itself several benefits. First of all, you get to know how your system behaves when subject to rough conditions and issues. Next, you give your incident response team a chance to identify their strengths and weaknesses. By removing the hurdles from your incident management process in a mock drill, you will quickly reduce your MTTR number. If you can identify areas of improvement in your systems during a mock drill, you can stop the issues from happening in a real-world scenario, therefore eliminating the need for incident response.

Study Incidents Later to Understand Their Cause Better

A crucial step in the standard issue-resolution process is to understand why an issue occurred. And surprisingly, this step is the most overlooked one too. MTTR is incomplete without a strong incident follow-up process. The team investigates the cause of the incident in detail to understand how to prevent similar incidents from happening again. While this might occur once a system returns to normalcy, the time spent on this step is part of the MTTR metric. Therefore teams need to be quick yet detailed in their after-incident research process.

Apart from stopping incidents from happening again, this introspection also helps guide further development of the system. If a new update is known to be the cause of an issue, it is rolled back and scrutinized to understand where things went wrong. This sometimes results in new features being entirely redesigned more efficiently and reliably. Therefore the post-incident research is not something you would want your incident response team to miss.

Minimize Ad Hoc Efforts

Ad Hoc efforts are put by the incident response team that is put together on the fly when an incident occurs. This is a highly economical method for smaller organizations as they are set free from the burden of maintaining a dedicated incident response team. It is also favorable in scenarios where not many incidents occur frequently, and a high MTTR does not affect the business since the Mean Time Between Failures (MTBF) for the company is high.

However, in most medium to large-sized organizations, MTBF is usually low due to the large size of enterprise applications. This means you need to maintain a top-notch MTTR number to keep your business and customers on track. Ad hoc is not known for this.

If you were to implement Ad hoc in a large-scale organization, it would take a considerable amount of time to find the right person to fix an issue on the spot, given the company’s size. Since there are so many people involved in the incident resolution process, proper training for each personnel is difficult. This straightforward means that you are going to have a tough time bringing down your MTTR number.

When MTTR Isn’t A Useful Metric

There are several occasions where MTTR does the opposite of its purpose. It is important to understand that MTTR is a single metric in collecting failure metrics like MTTF, MTBF, etc. It provides the best value when used in conjunction with the other failure metrics. On the other hand, MTTR is a dynamic metric in itself. This means that you will experience considerable damage to your business if you miss out on any aspect of the MTTR process. Here are some of the key situations in which MTTR is not the most useful tool/metric:

Lowering MTTR Is Worth The Effort

MTTR is a vague metric; it holds much more than what meets the eye. If done right, lowering MTTR can save a lot of revenue for your business. If mishandled, incidents would repeat themselves time and again, considerably reducing the uptime of your application. Therefore it is important to understand the metric well and take measures to reduce it as much as possible.