Coming Soon: your Rails app performance trends & outliers, via email
I follow a simple rule before configuring a monitoring alert: if I receive this alert at 3am, will I act on it?
If not, it shouldn't be an alert.
Few performance-related alerts meet this criteria. For example, if our app is running 25% slower, it's not worth a hasty 3am fix, but it is worth a first-thing-in-the-morning effort.
That's the drive behind a feature we'll make available soon: The Digest Email. Available in daily or weekly editions, the Digest Email summarizes your app performance and directs you to bottlenecks with ease:
How It Works
At a frequency of your choice (daily or weekly), we'll crunch the numbers on your app's performance (both web endpoints and background jobs). Performance is compared to the previous week, and highlights are mentioned in the email.
To start, there's three specific areas we're focusing on.
It's easy to just grab endpoints with large changes in their mean response time between today and last week. However, that adds significant noise: a rarely used endpoint, like
UsersController#forgot_password, may vary widely in response time. Is it worth the development performance effort if response times are bouncing between 100 ms - 500 ms? Frequently, the answer is no.
Scout works hard to identify significant trends. Some of the approaches our algorithms apply:
- Endpoints must meet a minimum threshold of time consumed (throughput * mean response time) to be considered. This filters out low-volume endpoints.
- There must be a minimum number of requests in each step of a duration for the trend to qualify.
- The trend must extend across a significant period of time - we don't consider a large spike in response time over a 5 minute period a trend.
To make tracking down the source of trends easier:
- If you've enabled deploy tracking, we'll attempt to identify if a deploy triggered the trend.
- If you've enabled deploy tracking and the GitHub integration, we'll mention the developers with commits associated with the deploy that triggered the change.
2. Slow Outliers
What if an endpoint is fine for 90% of users, but it becomes extremely slow for a small subset of users? The small percentage of users experiencing performance problems are frequently high-paying power users that are pushing your app the hardest. For example, a controller-action that renders all employees at a startup will load quickly while that same endpoint would fall over if that company was Apple.
Additionally, these very slow outliers can trigger frustrating capacity problems, and in a worst-case scenario, momentary downtime. It's far more difficult to determine the application capacity you need to serve your app when response times vary widely (Little's Law isn't valid across a wide distribution of response times).
We highlight endpoints that are triggering these slow outliers, but that's not all. We also identify any significant bottlenecks (example: a slow ActiveRecord query).
Bonus: if you've setup our GitHub integration, you'll see who last touched any expensive code paths.
3. The email subject
Our subject line is dynamic, changing with your aggregrate app performance. Here's an example:
If performance isn't changing, it's important to know that too:
Also, we display a friendly emoticon when things are going well:
It's a nice, friendly reward.
The goal: if things haven't changed, there's no need to open the email. If we think there's something worth investigating, we'll draw your attention.
We're limiting the number of recipients as we tune our algorithms based on your feedback. Enable the Digest Email in your user settings to ensure you'll be in our first access group.
Most app performance issues don't warrant immediate, one-off alerts, but they do warrant a holistic per-day or per-week review.
The Scout Digest Email aims to address this while identifying the source of issues.