CouchDB in production
John P. Wood of Signal, which offers a mobile customer engagement platform used by many top brands, recently created a couple of Scout Plugin for monitoring CouchDB. I’ve always been impressed by the team at Signal, so I was curious how they were using CouchDB in production. It turns out CouchDB is a huge part of their infrastructure – for example, one of their CouchDB databases is over 130GB in size.
John was kind enough to share his experiences with CouchDB below.
You use a number of different storage engines (MySQL, CouchDB, MongoDB, and Memcached) at Signal. Where does CouchDB fit in?
A couple of years ago we were running into performance issues with some very large MySQL tables. Queries against these tables were taking very long to run, and were causing page timeouts in our web application. At the advice of a friend who was helping us out as a consultant, we started looking at CouchDB. CouchDB views turned out to be a great fit for our problem.
A key component of our application is SMS messaging. The problematic MySQL queries we were running were collecting aggregate stats on these messages (how many messages did account A send in January of 2009, all of 2009, how many for all accounts, etc). Most of the queries were executing on past data, meaning the results of those queries would not change over time after that time period had past. So, it was simply a waste to re-calculate these numbers over and over. We considered using summary tables in MySQL to avoid this costly re-calculation, but saw them as being inflexible and difficult to maintain.
CouchDB Views use MapReduce to index your data. The map and reduce functions compute the results of your query, which can then be fetched via an HTTP request. As data is added/removed from the database, those indexes are incrementally updated. In addition, CouchDB stores view results throughout the B-tree data structure it uses for the index. If CouchDB sees that your query will include all of the children of a given node, it will simply pull the “summary” result from the parent node, preventing it from having to visit each of the child nodes for their individual results. This, and the fact that view results are computed when then view is built, makes querying views very fast. CouchDB helped us dramatically reduce the amount of time it took fetch this data.
So, we are primarily using CouchDB for large datasets when we know exactly how we plan on querying that data. We’re also using it for a few features that take advantage of CouchDB’s schema-less nature. We still use MySQL as our primary data store. We also use Memcached for web application caching, and MongoDB for geospatial queries.
When monitoring CouchDB, are there key stats you watch closely?
We keep an eye on the database read and write throughput, to get an idea of how hard we are hitting CouchDB. We also use the Scout CouchDB Database and CouchDB Lucene monitoring plugins to keep an eye on the size of our databases and Lucene indexes. Our messaging database, which holds information about every SMS message that we have ever sent, is a whopping 130GB. The Lucene index alone for this database is over 13GB. Being able to see the sizes of these data stores, and their growth over time, helps us plan for the future.
Performance-wise, are there any high-level metrics for CouchDB that are more important to watch than others?
In general, I’ve found CouchDB to be very efficient when it comes to memory usage, so that’s generally not a concern. Building views can be CPU intensive, so that is something people may want to keep an eye on, along with read and write throughput. I suppose the rest depends on how you are using CouchDB. Most of what we use it for is archiving existing data, once it reaches a current age. We don’t really run into issues getting that data from MySQL into CouchDB. We rarely see any 3xx or 4xx HTTP responses from CouchDB. I’d imagine this may be a bit higher for applications that use CouchDB as their primary data store. And, as I mentioned above, we pay close attention to the sizes of our databases, to make sure we have plenty of time to make adjustments if we see that we’re starting to run low on disk space.
Any tips for running CouchDB in production for someone coming from a relational database background?
The learning curve, when moving from a relational DB to CouchDB, is steep. In addition to the general NoSQL learning curve (self contained documents, no foreign keys), CouchDB views take some getting used to. The MapReduce paradigm is so different than the dynamic queries supported by a relational database, or something like MongoDB. There is no database console you can use to run ad-hoc queries against your data, and implementing a view takes quite a bit more time and effort then simply hacking together a SQL statement in your application code. But, it certainly has its strengths when dealing with large amounts of data. Tackling this learning curve was by far the most difficult part about moving to CouchDB. Setting CouchDB up in production was very straight forward, it’s very low maintenance, and we haven’t had many problems to speak of. As far as tips go, forget everything you know about databases, give yourself plenty of time, and learn by taking little baby steps. It really is a completely different beast.
The CouchDB Takeaway
Summarizing Signal’s experience with CouchDB:
- CouchDB views are great for querying aggregate stats on large databases. As data is modified the indexes are updated, ensuring results stay current. You also have the ability to see “stale” data, which can be retrieved very quickly. CouchDB also excels at replication.
- The learning curve for MapReduce is steep for those coming from a relational DB background. Implementing a view takes a quite a bit of effort compared to hacking together an SQL statement.
- CouchDB is easy to setup and is low maintenance. Watch out for the CPU usage (building views and be intensive) and the disk usage if you have a very large data set with a lot of views.