Percona XtraDB cluster overview

Abstract

This document describes the concepts, the architecture, the strengths and the limitations of using Percona XtraDB cluster.

Introduction concepts

Server-centric

Usually in the application layer if you need more performance you can add more resources but what about the databases?

1) you have to distribute all the changes to all the servers in real time (major point)

2) the databases has to be available for all the applications

3) the application has to be able to do changes

 

In the common server-centric topology, one server streams the data to another one, but this isn’t the best method to protect your data.

You can create a real complex topology, like in this image, but you will always have a single point of failure: the master node.

The replication and the communication between the master and the slave nodes are async so the master does not care about the slave status (transaction, replication delay…) and if the master crashes, transaction that it has committed might not have been transmitted to any slave, so a failover might cause data loss.

Data-centric

Another solution to manage a database cluster is the data-centric method.

Data-centric refers to an architecture where data is the primary and permanent asset, and applications come and go.

Data is synchronised between servers.

The data-centric method (virtual synchronous) “synchronous” guarantees that:

  • if changes happened on one node of cluster, they happened on other nodes “synchronously”
  • it is always HA (no data loss)
  • data replicas are always consistent
  • transactions can be executed on all node in parallel

How does the virtual synchronous works

In node1 the user starts a transaction and does some queries, than when the user does the commit, at that time the cluster sends this write-set (network event notification) to the other nodes; the nodes say “I got the write-set” and send back an acknowledge; at this point, as you can see in the image, the certification process begins, and only when the certification is preceded, the node 1 executes physical COMMIT and when this process is finished, the node 1 returns to the client a feedback “the commit is done”.

Now pay attention at the node2: it has the write-set, it certifies the packages as well, and it starts to apply the transaction; so as you can see there can be some small time out-of-sync.

As you can understand, here the critical point is the means of communication between the nodes: the network; so you have to have a very good quality and low latency network.

Certification

In the certification-based replication, each transaction has a global ordinal sequence number (that is the order number for the execution of the transactions) and this procedure determines whether or not the node can apply the write-set.

When the node receives the COMMIT from client, it collects all changes into a write-set and it starts to send this set to all the nodes. The write-set undergoes a deterministic certification test; during this test, all the nodes have to determinate if they are able to replay this write-set or if they have some blocking like some other transaction in their queue that blocks that write-set; to do this, the certification test use the primary keys (so it’s very important to use it!).

This procedure is deterministic and all the nodes receive transactions in the same order. So all nodes reach the same decision about  the outcome of the transaction. The node that started the transaction, as you can see in the previous picture, can then notify the client application whether or not it has committed the transaction.

During this procedure you can have 2 type of errors:

  • Brute force abort when other node execute conflicting transaction and local active transaction needs to be killed;
  • local certification errors: when 2 nodes execute conflicting workload and add it to the queue at the same time.

When you have only one server, you use the traditional locking model, and if you try to write the same data in two transactions at the same time, the transaction 2 must wait for the transaction 1, and if the transaction 1 doesn’t commit, the transaction 2 times out.

However, If you work in a cluster situation you have a different behavior called Optimistic locking. As you can see in the bottom part of the picture, you have a transaction 1 starting on server 1 and transaction 2 starting on server 2; on the server 1 you can update row and you can commit it, while doing the same steps on the server 2: at this point the certification process comes into play; it understands that there is another transaction coming from server 1, it checks its queue and evaluates that the two transactions are in conflict. So the result will be that transaction 1 will be completed while transaction 2 won’t.

Percona XtraDB cluster

Percona XtraDB Cluster is an active/active high availability and high scalability open source solution for MySQL ® clustering.

It integrates Percona Server and Percona XtraBackup with the Codership Galera library of MySQL high availability solutions in a single package that enables you to create a cost-effective MySQL high availability cluster

Main features

Synchronous replication: Data is written to all nodes simultaneously, or not written at all if it fails even on a single node.

Multi-master replication: Any node can trigger a data update.

True parallel replication: Multiple threads on slave performing replication on row level.

Automatic node provisioning: You simply add a node and it automatically syncs.

Data consistency: No more unsynchronized nodes.

PXC Strict Mode: Avoids the use of experimental and unsupported features.

Configuration script for ProxySQL: Percona provides a ProxySQL package with the proxysql-admin tool that automatically configures Percona XtraDB Cluster nodes.

Automatic configuration of SSL encryption:  Percona XtraDB Cluster includes the pxc-encrypt-cluster-traffic variable that enables automatic configuration of SSL encryption.

Optimized Performance: Percona XtraDB Cluster performance is optimized to scale with a growing production workload.

Data compatibility: You can use data created by any MySQL variant.Application compatibility: There is no or minimal application changes required.

Percona XtraDB cluster is leaning:

  • on Galera replication plugin, that enables write-set replication service functionality;
  • on Group communication plugin (f.e. gcomm), important for the communication between the nodes and to maintain the order of transaction execution;
  • on wsrep API which is the interface between the galera application plugin and databases server.

Percona XtraDB Cluster is based on Percona Server running with the XtraDB storage engine (which is a Percona version of InnoDB storage engine). It uses the Galera library, which is an implementation of the write-set replication API.

The group communication layer manages the transactions and their sorting; when node 1 commiting a transaction, the gcomm send this information (the write-set) to all the nodes.

If there are concurrent transactions on more nodes, the gcomm define an order to guarantees that all messages are read in the same order on all the nodes.

The Flow control

This is a great replication feedback mechanism offered by Galera.

This feedback allows any node in the cluster to instruct the group when it needs replication to pause and when it is ready for replication to continue. This prevents any node in the synchronous replication group from getting too far behind the others in applying replication.

Limitation

  • only InnoDB tables are supported
  • it use optimistic locking instead of traditional. So if you are writing on multiple nodes at the same time, you can have conflicts
  • the weakest node limits the write performance (if you have a weak node which could not replay the transaction faster than other, that node sends a lot of flow control messages)
  • all the tables  should have a Primary key
  • large transaction are not recommended

lock table, GET_LOCK() are not supported (because when you lock a table only going to lock the table locally on your node)

State Transfer

There are 2 kind of State Transfer:

  • the Full data SST: used for the new nodes or for the node long time disconnected
  • the incremental ST: used for node short time disconnected

The SST  (snapshot state transfer) could be perform with:

 

  • mysqldump
  • rsync
  • Xtradbbackup

 

The first two block the donor during the copy, instead using the Xtradbbackp the donor is always available (so the latter is the recommended method)

Load Balancing

In order to have load balancing between the nodes, you can use a proxy like HAproxy or ProxySQL. We use and recommend the second one.

ProxySQL (a Layer 7 proxy) has an advanced multi-core architecture. It’s built from the ground up to support hundreds of thousands of concurrent connections, multiplexed to potentially hundreds of backend servers.

The main features are:

  • Query caching
  • Query Routing
  • Advanced configuration with 0 downtime
  • Application layer proxy
  • Advanced topology support

Quorum

In cluster solution, every node has a point, a vote; this points are very important in case of problems to understand if the cluster is still consistent and which nodes are broken.

In order to have spit-brain protection you need at least 3 nodes because if one of the nodes has some problems, the other 2 nodes have 2 points (that are bigger than the majority of 3), while the broken node has only 1 vote; the quorum system puts this broken node in a non-primary state and so the node doesn’t accepts any read or write.

It’s a preferred choice to have a number of odd nodes in order to prevent a pair point situation in which all the nodes goes in a non primary state.

Conclusions

not recommended very recommended
write-scalable solution easy to scale the reads
large transactions data consistency
working with Foreign keys easy failover
sharding no data loss if server crashes
easy to add/remove a node

Linkography

Thanks to Percona community and Percona’s tutorial site

Leave a Reply

Your email address will not be published. Required fields are marked *