September 24, 2013
Few things feel worst than rolling out a High Availability (HA) system, then regularly seeing that system collapse. For our team at Rails Machine, that failing HA system was MySQL Multi-Master Replication Manager (MMM).
What's wrong with MySQL MMM? What's special about Galera Cluster? Read on!
How MySQL MMM Works: a server with MySQL MMM installed polls the MySQL nodes every ten seconds (default interval), checking their health. Only one of the servers receives the writer role - the rest can have reader roles. MMM maintains a virtual IP that points to the node with the writer role.
The problem is the polling: if MySQL MMM is polling every ten seconds, what happens when the writer node fails between the checks? You're probably handling many transactions if you have an HA setup - you could have hundreds of thousands of failed transactions before MMM detects the writer node isn't healthy. Worse, what if there's an internal issue where replication fails first, then transactions fail, then you switch the writer role to the other node? The other node won't be consistent with the original writer node.
Decreasing the polling interval to one second won't fix it - large databases are likely doing many transactions per/second.
So, polling is the fundamental issue, but beyond fundamental issues, MySQL MMM in the wild often results in hard-to-recover problems. Here's Baron Schwartz, MySQL God at Percona, on MMM's flaws:
In short, MMM causes more downtime than it prevents. It's a Low-Availability tool, not a High-Availability tool. It only takes one really good serious system-wide mess to take you down for a couple of days, working 24-7 trying to scrape your data off the walls and put it back into the server. MMM brings new meaning to the term cluster-f__k.
Despite MMM's flaws, it was the least broken way to do HA for MySQL for a time. But, times have changed. Even the creator of MySQL MMM says it's time for a change. Here's Alexey's comment on Baron's blog post on MMM:
I'm the original author of MMM and I completely agree with you. Every time I try to add HA to my clusters I remember MMM and want to stab myself because I simply could not trust my data to the tool and there is nothing else available on the market to do the job reliably.
Our Galera Cluster setup still uses polling to do health health checks - how is that any better than MMM?
The answer lies in how replication works. With standard MySQL, writes to the master are recorded in a binary log. Slaves then reproduce the queries in the binary log. There is always a delay between when a query is run on the writer and when other nodes run it. This is asynchronous.
MySQL asynchronous replication has the following issues:
With Galera, transactions are confirmed across all nodes before they are committed. If a transaction fails on a node, that node is immediately removed from the cluster. In other words, Galera replication is synchronous. You never lose a transaction - there is no lag (and Galera's row-based replication is around 5x faster).
MySQL MMM is an outsider - it's "dumb" as far as what's actually going on with the server. It does one kind of check and that's all it knows how to react to.
Galera Cluster is an "insider" and can therefore be smarter about the internal state of each node and do the right thing without manual intervention (a node being synced or unsynced, for example, becoming a donor, etc - it's all automatic).
While it's possible to write to any node in a Galera Cluster, we've chosen to minimize potential deadlocks and only write to a single node. To do this, we use HAProxy: we have a frontend for the writer node and a frontend for reads that balances queries across all of the nodes. The writer sends queries to a single node and the other nodes are backups.
If HAProxy detects that the writer node isn't healthly, it immediately promotes one of the backup nodes to the writer. Frequently, MySQL MMM would end up taking all of the nodes offline in this case - HAProxy doesn't do this. We may lose a couple of queries when the writer backend is updated, but it won't result in an inconsistent data set across servers, which is far more crippling.
We don't automatically repair failed nodes, but that's OK. My primary concern is just making sure a healthy node is taking writes, and HAProxy does that.
When a node fails with standard MySQL replication, you put an intense load on one server while getting replication set up again (that single server taking not only reads and writes but the load from innodbbackupex).
With Galera, you take one of the nodes offline (so you need a minimum of three nodes). That node becomes the donor - write operations to it are blocked. The node transfers its data to the failed node (or new node) via rsync. Then, both the donor node and the failed node catchup by running queries from the slave queue.
Once both nodes are back to a
Synced state, HAProxy will automatically mark them as
Up and add them back to the front-end.
Our reasons for switching to MariaDB are a mix of technical and political:
Absolutely. We don't see any reasons to not switch to both MariaDB and Galera Cluster.
The great thing about monitoring Galera is it's layered - we can easily monitor each part of the stack. We use Scout for monitoring, which means we just need to use the following plugins:
Until recently, MySQL MMM was the best (but broken) way to add High Availability to MySQL. Galera Cluster finally adds true High Availability to MySQL primarily thanks to synchronous replication.