Optimize Trace Memory with Scout

Application performance monitoring (APM) is a process of monitoring and analyzing performance issues within an application. In monolithic architecture, monitoring the performance of an application using APM tools was straightforward. However, once the application adopts microservice architecture, the application becomes more complex, and the business functionalities flow into different microservices to complete the workflow. Therefore, if something goes wrong in the transaction, it becomes difficult to diagnose where it went wrong. Distributed tracing solves this tracing issue in distributed systems. In detail, let’s discuss tracing in APM vs. distributed tracing using frameworks like Opentelemetery.

What is Tracing in APM vs. Opentelemetery?

Tracing is a process of tracking transactions within the application. It provides detailed insights into a single transaction within the application.

In the old days, most applications followed Monolithic architecture, and the Application Performance Monitoring process worked well with that architecture. But, when the industry evolved, companies started to adopt microservice architecture. As a result, it has become difficult for Application Performance Monitoring(APM) tools to trace transactions between microservice.

This is because tracing in a microservice architecture differs from tracing transactions in monolithic applications. That is where distributed tracing comes into play. Let’s discuss APM tracing and distributed tracing using Opentelemetery in detail. 

APM tracing

APM tracing involves tracking transactions within an application based on an event. A transaction can be a series of events inside an application. For example, in an e-commerce application, a user adds a product to the cart as an event. Then, another event is the user checking out and paying for an item. Back in monolithic architecture, a single backend server handled all this logic. So, tracing the transaction was more straightforward as well. 

Application tracing is simple if the application follows a monolithic architectural pattern. But, tracing transactions became more complex when the application and industry evolved into microservice.

With such complexity in distributed applications, there are many variables to analyze when troubleshooting a performance bottleneck in an application. So, it becomes necessary to evolve monitoring from APM tracing to distributed tracing to achieve full-stack observability within the application. 

OpenTelemetery Tracing

Before we get into Opentelemetry and distributed tracing details, let’s discuss telemetry. Telemetry is the process of collecting and transmitting information about a system. We will discuss telemetry data in a later part. 

Opentelemetry is a Cloud Native Computing Foundation (CNCF) project started in 2019. It is an open-source observability framework to collect telemetry data from distributed systems to troubleshoot and analyze application performance issues.

Since the distributed system is so valuable, you might wonder why companies adopt an open-source tool into their ecosystem. Unfortunately, creating a distributed tracing instrumentation from scratch that seamlessly works across different frameworks is hard and expensive. That’s why most APM tools adapted open telemetry into their system. 

Why is Tracing Important?

Let’s discuss the importance of tracing with an example. A few weeks ago, our team faced the problem that a user couldn’t complete payment in the application. Since it’s a critical issue, our team started analyzing that issue as a priority. 

First, we analyzed the payment service to check if it was healthy and working as it should. Once our team confirmed that the payment service was healthy, we started analyzing the user service to see if the issue was specific to the user information in the database. Again, everything was good, and there were no data anomalies.  Upon investigating in-depth about the issue, we found out that the issue was coming from the authentication service. The authentication service missed providing some critical information needed from the payment service. Then, we fixed the problem and closed the issue. Even though the problem was small, the effort and time we took to debug and fix the issue were high. 

Imagine having hundreds of microservices connected to complete a single transaction facing such problems. It will be problematic for the developers even to diagnose the issues. That’s where tracing comes into the picture and saves us debugging time. 

Tracing is an essential pillar of observability. Let’s discuss observability and some of the pillars of observability in application monitoring.

A Building Block to Full Stack Observability

Observability provides great insight into a distributed system that helps teams and engineers troubleshoot application and performance issues. Traces are essential pillars of observability. It provides context across the distributed system for each transaction between services. It helps the team analyze how quickly each microservices performs in a distributed system. It also helps them to resolve issues quickly. 

More than that, tracing helps to answer some critical questions while troubleshooting: 

  1. Which service is causing the issue?
  2. What are the health and performance of each service?
  3. Which service could potentially cause a performance bottleneck?
  4. Why is the specific “service” broken?

It collects and aggregates all the telemetry data in the distributed system to analyze it to troubleshoot performance bottlenecks. Observability helps the managers and engineers maintain and improve the overall performance and enhance the architectural decision towards the application.

It’s about instrumenting the system to gather data to analyze the health and performance to understand why it’s behaving in a certain way. Telemetry data is helpful for such analysis when a system develops a fault to make this process efficient. 

Let’s discuss telemetry data, commonly referred to as pillars of observability:

  1. Logs
  2. Metrics
  3. Tracing
  4. Events

Logs

Logs are structured and unstructured forms of text from the applications. Each service part of a distributed system produces a textual format called logs. Logs help to analyze the system and uncover anomalies and errors. 

Logs are easy to generate. Most of the logs that a specific service generates are about actions and events within the service. Since logs are raw and unstructured, it’s difficult to trace any transaction-related issue. Logs are more useful when troubleshooting cache and database issues specific to the service. 

Most framework libraries come with log support, and each framework differs in terms of logs. So, it’s a bit unstructured and difficult to consume for analyzing the distributed system. 

Metrics

Metrics are a numerical representation of data to analyze a component's behavior over time. Unlike logs, metrics help measure a system's performance running the service. For example, a metric can be the amount of memory consumed by a system over time. 

It represents the health of the specific service by providing information such as the number of HTTP requests handled by the service over a specific period. There are different kinds of metrics for monitoring a system. They are called “Golden Signals,” which help monitor a service's overall health:

  1. Availability: It provides a percentage of errors on total requests.
  2. Errors: These represent the error rate in the specific service.
  3. Request rate: It represents the rate of incoming requests in the service.
  4. Utilization: It presents information about a specific service host machine. It represents CPU usage and memory usage.
  5. Latency represents the percentage of service response time for a specific period.
  6. Saturation: It represents how loaded the specific service is.

Traces

Traces represent a transaction in a distributed system. As we discussed,  It includes a series of events and actions that complete a workflow. Therefore, it can involve different services in a distributed system. Tracing helps to view and analyze the entire lifecycle of a request. It is an essential part of full-stack observability.  It helps to understand the team and what is happening in the whole system. 

Traces represent the complete journey of the transaction. It allows the team to observe systems, specifically containerized applications, serverless applications, and microservices. Traces combines logs and metrics to analyze and visualize the transaction. In that way, it makes the process easier to debug the performance bottleneck and fix issues quickly,

Events

The events represent a state change or immutable records of events by actions over time. For example, the user adding an item to the cart is an action. So the system updates the user card with that item in the database as an event.

Events are structured logs. It follows a standardized format (JSON) that helps visualize the information in a structured way.

A Full Scope on All Servers

Collecting, Aggregating, and analyzing data for service to gain insight into the application can be cumbersome for developers and teams. To solve that issue, Opentelemetery simplifies collecting and aggregating the data by integrating with each service to solve that issue. 

Opentelemetry enhances the process of collecting, processing, and transmitting telemetry data to monitoring and observability providers like Scout that help gain insights and improve the value you can derive from this newly unified data.

Improve Your Observability with ScoutAPM

So far, we have seen what tracing is in APM and how it differs from distributed tracing using Opentelemetery. We have also discussed full-stack observability and pillars of observability. 

Monitoring is the first step toward building a resilient application for your business. But, Once the application grows and becomes a distributed system, observability can help you understand your application in depth. It helps to gain complete insights into what is happening in the system.

Opentelemetry helps have full-stack system observability by collecting, aggregating, and transferring data to the target system. We at Scout are building a next-generation observability platform based on Opentelemetry.