From e2c7b1305ca8495065dcf40fd2092d7c698dd6ea Mon Sep 17 00:00:00 2001 From: "Suren A. Chilingaryan" Date: Tue, 20 Mar 2018 15:47:51 +0100 Subject: Local volumes and StatefulSet to provision Master/Slave MySQL and Galera cluster --- docs/databases.txt | 62 ++++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 49 insertions(+), 13 deletions(-) (limited to 'docs/databases.txt') diff --git a/docs/databases.txt b/docs/databases.txt index 254674e..331313b 100644 --- a/docs/databases.txt +++ b/docs/databases.txt @@ -7,8 +7,9 @@ Gluster MyISAM (no logs) 1 MB/s unusable 150% 600-800% Perfect. But too slow (up to completely unusable if bin-logs are on). Slow MyISAM recovery! Gluster/Block MyISAM (no logs) 5 MB/s slow, but OK 200% ~ 50% No problems on reboot, but requires manual work if node crashes to detach volume. Galera INNODB 3.5 MB/s fast 3 x 200% - Should be perfect, but I am not sure about automatic recovery... - MySQL Slaves INNODB 6-8 exp. fast Available data is HA, but caching is not. We can easily turn the slave to master. - DRBD MyISAM (no logs) 4-6 exp. ? I expect it as an faster option, but does not fit complete concept. + Galera/Hostnet INNODB 4.6 MB/s fast 3 x 200% - + MySQL Slaves INNODB 5-8 MB/s fast 2 x 250% - Available data is HA, but caching is not. We can easily turn the slave to master. + DRBD MyISAM (no logs) 4-6 exp. ? I expect it as an faster option, but does not fit the OpenShift concept that well. Gluster is a way too slow for anything. If node crashes, MyISAM tables may be left in corrupted state. The recovery will take ages to complete. @@ -29,9 +30,13 @@ So, there is no realy a full HA capable solution at the moment. The most reasona (i.e. status displays), current data is available. And we can easily switch the master if necessary. The other reasonable options have some problems at the moment and can't be used. - - Galera. Is a fine solution, but would need some degree of initial maintenance to work stabily. Furthermore, the caching is quite slow. And the - resync is a big issue. - - Gluster/Block would be a good solution if volume detachment is fixed. As it stands, we don't have HA without manual intervention. Furthermore, the + - Galera. Is a fine solution. The caching is still quite slow. If networking problem is solved (see performance section in network.txt) or host + networking is used, it more-or-less on pair with Gluster/Block, but provides much better service to the data reading clients. However, extra + investigations are required to understand robustness of crash recovery. In some cases, after a crash Galera was performing a full resync of all + data (but I was re-creating statefulset which is not recommended practice, not sure if it happens if the software maintained properly). Also, at + some point one of the nodes was not able to join back (even after re-initializing from scratch), but again this hopefully not happening if the + service is not pereodically recreated. + - Gluster/Block would be a good solution if volume detachment is fixed. As it stands, we don't have HA without manual intervention. Furthermore, the MyISAM recovery is quite slow. - HostMount will be using our 3-node storage optimally. But if something crashes there is 1 week to recache the data. @@ -80,16 +85,21 @@ Galera * If all nodes crashed, then again one node should restart the cluster and others join later. For older versions, it is necessary to run mysqld with '--wsrep-new-cluster'. The new tries to automatize it and will recover automatically if 'safe_to_bootstrap' = 1 - in 'grstate.dat' in mysql data folder. It should be set by Galera based on some heuristic, - but in fact I always had to set it manually. IMIMPORTANT, it should be set only on one of - the nodes. - - - Synchrinization only works for INNODB tables. Furthermore, binary logging should be turned - on (yes, it is possible to turn it off and there is no complains, but only the table names are - synchronized, no data is pushed between the nodes). + in 'grstate.dat' in mysql data folder. If cluster was shat down orderly, the Galera will + set it automatically on the last node to stop the service. In case of a crash, however, + it has to be configured manually on the most up to date node. IMIMPORTANT, it should be + set only on one of the nodes. Otherwise, the cluster will get nearly unrecoverable. + * So, to recover failed cluster (unless automatic recovery works) we must revert to manual + procedure now. There is 'gmanager' pod which can be scalled to 3 nodes. We recover a full + cluster in this pods in required order. Then, we stop first node and init a statefulSet. + As first node in the statefulSet is ready, we stop second node in 'gmanager' and so on. + + - IMPORTANT: Synchrinization only works for INNODB tables. Furthermore, binary logging should + be turned on (yes, it is possible to turn it off and there is no complains, but only the table + names are synchronized, no data is pushed between the nodes). - OpenShift uses 'StatefulSet' to perform such initialization. Particularly, it starts first - node and waits until it is running before starting next one. + node and waits until it is running (and ready) before starting next one. * Now the nodes need to talk between each other. The 'headless' service is used for that. Unlinke standard service, the DNS does not load balance service pods, but returns IPs of all service members if appropriate DNS request is send (SRV). In Service spec we specify. @@ -112,7 +122,33 @@ Galera serviceName: adei-ss There are few other minor differences. For instance, the 'selector' have more flexible notation and should include 'matchLabels' before specifying the 'pod' selector, etc. + + - IMPORTANT: If we use hostPath (or even hostPath based pv/pvc pair), the pods will be assigned + to the nodes randomly. This is not ideal if we want to shutdown and restart cluster. In general, + we always want the first pod to end-up on the same storage as it will be likely the one able to + boostrap. Instead, we should use 'local' volume feature (alpha in OpenShift 3.7 and should be + enabled in origin-node and origin-master configurations). Then, openshift 'pvc' to specific node + and the 'pod' executed on the node where its 'pvc' is bounded. + + - IMPORTANT: StatefulSet ensures ordering and local volume data binding. Consequently, we should + not destroy StatefulSet object which save the state information. Otherwise, the node assignments + will chnage and cluster would be hard to impossible to recover. + + - Another problem of our setup is slow internal network (since bridging over Infiniband is not + possible). One solution to overcome this is to run Galera using 'hostNetwork'. Then, however, + the 'peer-finder' is failing. It tries to match the service names to its 'hostname' expecting + that it will be in the form of 'galera-0.galera.adei.svc.cluster.local', but with host networking + enabled the actual hostname is used (i.e. ipekatrin1.ipe.kit.edu). I have to patch peer-finder + to resolve IPs and try to match the IPs. - To check current status of the cluster SHOW STATUS LIKE 'wsrep_cluster_size'; + +Master/Slave replication +======================== + - This configuration seems more robuts, but strangely has a lot of performance issues on the + slave side. Network is not a problem, it is able to get logs from the master, but it is significantly + slower in applying it. The main performance killer is disk sync operations triggered by 'sync_binlog', + INNODB log flashing, etc. Disabling it allows to bring performance on reasonable level. Still, + the master is caching at about 6-8 MB/s and slave at 4-5 MB/s only. \ No newline at end of file -- cgit v1.2.3