Kestrel in Production at Papertrail

We’ve added prolific plugin contributor Eric Lindvall’s latest plugins to Scout: Kestrel Overall and Kestrel Queue Monitoring. Kestrel is a simple message queue built from production needs at Twitter. Being the gentleman he is, Eric shared his experiences with Kestrel at Papertrail, a hosted log aggregation service.

Why Kestrel?

As all good message queues should, it provides:

We picked Kestrel because it has bounded memory usage (it has a configuration setting to specify how many queued messages should reside in RAM, generally defaulting to 128MB), it’s small (I was able to read the entire Scala codebase in a weekend), and running on the JVM (which we have experiencing with) was a plus.

What other tools did you consider?

We seriously considered Kafka and from what I’ve heard from people who have used it, I’m sure we would have been happy with it as well.

How is the setup/maintenance/learning curve compared to other tools?

It’s very simple to setup (mostly just involving downloading it and creating a configuration file) and just runs.

Key metrics to monitor?

Monitoring queue depths of all of your queues is important in understanding the health of your application. Under normal circumstances, queues should be almost always empty. Setting up alert triggers for queue depth or queue age (how old the oldest message is) is a great way to get notified as soon as a component in your system starts to have problems.

What do you use it for?

Papertrail receives log messages from our customers’ applications and servers, which means we are subject to their traffic patterns. One customer might go from 20 log messages per second to 2,000 per second in the span of a second, which is much sooner than any closed-loop autoscaling could provision new capacity.

Our job is to apply a relatively heavyweight process to those messages: persist them, make them searchable, update any active Web and command-line “tail” clients, and potentially, invoke alerts that now have new matches. Our first responsibility is to ensure that one customer’s log volume does not impact another customer, and our second responsibility is to process each customer’s messages as quickly as practical.

Our end-to-end server to Papertrail to viewer target is 2 seconds, which we hit more than 99.5% of any given month. 700 milliseconds is more typical, and that includes traversing the Internet twice.

Between the log volume, the variability in log volume, and the need to service the messages extremely quickly (with what could be a heavyweight process), we care about efficient queuing. That’s where a message queue like Kestrel or Kafka comes in.

Coming from a Ruby world, why Kestrel vs. Resque?

There are a lot of things that could be done with either Kestrel or Resque. Because Resque is backed by Redis, you have to remember that all of the messages waiting to be processed have to be able to fit in the RAM of the Redis server, with Kestrel you could queue millions or billions of messages and then start to pull them off.

The biggest difference between the two has to do with how it impacts the architecture of your system.

With Resque, the message includes the name of the class that is supposed to do the work, which makes it more difficult (but not impossible) if you wanted to drastically change how that work is performed.

With Kestrel, messages are pulled from a named queue, which can make it easier for a developer to think about problems in a way that can give them more flexibility (including fairly easily changing the language/platform the work is being performed on).

Many of the benefits that you see with Kestrel often won’t be evident until a high volume of work is being performed by the system.

Lets say I have a Ruby application. How hairy is Kestrel if you don’t have Java experience?

Kestrel speaks the memcached protocol, so it is very easy to use existing libraries to talk to it.

From Ruby, Twitter has a client that makes it even easier.