The 3 pillars of our Rails Monitoring Stack

Life as a cross-dresser must be unnerving. I could buy a wig, put some makeup on my adam’s apple. Buy a skirt. Does a sock-filled bra work for my bosom?

With a bit of mood lighting, I might actually look like a woman, but it would be difficult to relax. Knowing my manhood could be exposed in a multitude of ways – some stubble on my chin, hair on my legs, forgetting to make my voice less deep, etc. would leave me on-edge.

In some ways, I feel like I identified with cross-dressers before we solidified our Rails monitoring stack over the past year. I just felt vulnerable – that one wrong move could send our Rails stack tumbling.

We’ve standardized our setup, and I thought I’d share the 3 tools we’re using that make me feel much less like a cross-dresser these days.

The 3 pillars of our Rails monitoring stack

We break Rails monitoring into the 3 parts below (along with the tools we use):

Process Monitoring

System Performance

Exception Notifications

“The Diaper” A safeguard for emergencies

“The Nerves” Preventing future problems

“The Megaphone” A loud voice when your app is breaking

Monit Scout Exception Notification plugin
Ensure Mongrels are running & restart leaking processes Catch disturbing trends before they become problems – disk space usage, server load, slow requests, etc. Organizing and collecting application exceptions

So, to be clear, even though we built Scout, we use a combination of tools to make sure our Rails apps are running.

The diaper – Process monitoring with Monit

If a key process, like a Mongrel server containing a Rails application, dies or leaks memory, you often want to restart it immediately. Frankly it’s panic time – it could happen at any time, and you may not have time to figure out what went wrong. You just want the application back up. We’ve been using the open-source tool Monit for this task for some time (God, a Ruby-powered process and task monitoring tool is another option).

Configure a restart script

I’m not going to cover the monit installation process (monit’s documentation), but I’ll show a simple example for restarting a Mongrel process.

Here’s how we restart a Mongrel process that:

check process mongrel-8000 with pidfile /var/run/mongrel_cluster/  group mongrel_staging
  start program = "/usr/bin/mongrel_rails cluster::start -C /etc/mongrel_cluster/app_name.yml --only 8000 --clean" 
    as uid deploy and gid deploy
  stop program  = "/usr/bin/mongrel_rails cluster::stop -C /etc/mongrel_cluster/app_name.yml --only 8000 --force --clean" 
    as uid deploy and gid deploy
  if totalmem > 100.0 MB for 5 cycles then restart

(For automating the setup of these configuration—based on your mongrel configuration, check out the capistrano_monit extension)

It’s easy to test your monit setup, once it is in place:

  1. Tail your system message log (i.e. tail -f /var/log/messages)
  2. Manually kill the process
  3. Watch for Monit to restart the process in the messages log. It looks like this:

Aug 22 13:57:13 test monit[32613]: 'mongrel_staging-9000' process is not running 
Aug 22 13:57:13 test monit[32613]: 'mongrel_staging-9000' trying to restart 
Aug 22 13:57:13 test monit[32613]: 'mongrel_staging-9000' start: /usr/bin/mongrel_rails 
Aug 22 13:58:19 test monit[32613]: 'mongrel_staging-9000' process is running with pid 678

It’s just a diaper – it won’t solve underlying problems

Monit will ensure we’re not totally dead, but it won’t prevent problems from happening in the future. It can restart a memory-leaking process, but it won’t give us any clues about the leak. It’s not preventative medicine. We need some preventive medicine.

The nerves – system performance with Scout

When I was a kid, I once wished I that I couldn’t feel pain. I was awkward and fell down a lot. As I got older, I realized some pain is good – it’s our body’s way of saying “Hey, slow down a second – you need to check this out”.

Scout is our nerve center. We use it to monitor trends in our Rails stack. Has disk space usage quickly increased? Has there been a spike the in the server load? How many more users can our current hardware handle?

For example, a while back we noticed that disk space usage was increasing at an alarming rate on one of our servers. We caught it early-on by looking at the dramatic increase in disk space usage from the graph below:

The problem? An issue with an Amazon S3 backup. We fixed the problem far before it became an issue, and as you can see from the graph, things returned to normal.

In addition to disk usage, we use Scout to monitor our Rails requests, server load, memory usage, and more.

The megaphone – Exception notifications with the Exception Notification plugin

We’ve got most of our vitals covered – our Rails app will restart when it dies automatically with Monit. We’ll be able to identify system resource trends quickly with Scout. However, things often go wrong when our application server and Rails stack is running fine.

The tried-and-true method for instant exception notifications is the Exception Notification plugin. If you’re part of a development team, you might want to look at Hoptoad. Hoptoad provides organization to the typical flood of emails the Exception Notification plugin can generate. You install the Hoptoad plugin, which contacts the Hoptoadd server when an exception occurs. Hoptoad eliminates duplicate emails and makes it easier to organize exceptions across applications.

Update: Hoptoad has been re-branded to Airbrake Bug Tracker.

It’s about piece-of-mind

From start-to-finish, you can have Monit setup and tested in a couple of hours (at most). The Scout and Exception Notification plugin/Hoptoadd installation process should be measured in minutes. It’s time well-spent knowing things are in order when you aren’t in front of your computer.