1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
|
- The storage for HA datbases is problematic. There is several ways to organize storage. I list major
characteristics here (INNODB is generally faster, but takes about 20% more disk space. Initially it
significantly faster and takes 5x disk space, but it normalizes...)
Method Database Performance Clnt/Cache MySQL Gluster HA
HostMount MyISAM/INNODB 8 MB/s fast 250% - Nope. But otherwise least problems to run.
Gluster MyISAM (no logs) 1 MB/s unusable 150% 600-800% Perfect. But too slow (up to completely unusable if bin-logs are on). Slow MyISAM recovery!
Gluster/Block MyISAM (no logs) 5 MB/s slow, but OK 200% ~ 50% No problems on reboot, but requires manual work if node crashes to detach volume.
Galera INNODB 3.5 MB/s fast 3 x 200% - Should be perfect, but I am not sure about automatic recovery...
Galera/Hostnet INNODB 4.6 MB/s fast 3 x 200% -
MySQL Slaves INNODB 5-8 MB/s fast 2 x 250% - Available data is HA, but caching is not. We can easily turn the slave to master.
DRBD MyISAM (no logs) 4-6 exp. ? I expect it as an faster option, but does not fit the OpenShift concept that well.
Gluster is a way too slow for anything. If node crashes, MyISAM tables may be left in corrupted state. The recovery will take ages to complete.
The Gluster/Block is faster, but HA suffers. The volume is attached to the pod running on crashed node. It seems not detached automatically until
the failed pod (in Unknown state) is killed with
oc -n adei delete --force --grace-period=0 pod mysql-1-m4wcq
Then, after some delay it is re-attached to the new running pod. Technically, we can run kind of monitoring service which will detect such nodes
and restart. Still, this solution is limited to MyISAM with binary logging disabled. Unlike simple Gluster solution, the clients may use the system
while caching is going, but is quite slow. The main trouble is MyISAM corruption, the recovery is slow.
Galera is slower when Gluster/Block, but is fully available. The clients have also more servers to query data from. The cluster start-up is a bit
tricky and I am not sure that everything will work smoothely now. Some tunning may be necessary. Furthermore, it seems if cluster is crashed, we
can recover from one of the nodes, but all the data will be destroyed on other members and they would pull the complete dataset. The synchronization
is faster when caching (~ 140 MB/s), but it wil still take about 10 hours to synchronize 5 TB of KATRIN data.
So, there is no realy a full HA capable solution at the moment. The most reasonable seems compromising on caching HA.
- MySQL with slaves. The asynchronous replication should be significantly faster when Galera. The passthrough to source databases will be working
(i.e. status displays), current data is available. And we can easily switch the master if necessary.
The other reasonable options have some problems at the moment and can't be used.
- Galera. Is a fine solution. The caching is still quite slow. If networking problem is solved (see performance section in network.txt) or host
networking is used, it more-or-less on pair with Gluster/Block, but provides much better service to the data reading clients. However, extra
investigations are required to understand robustness of crash recovery. In some cases, after a crash Galera was performing a full resync of all
data (but I was re-creating statefulset which is not recommended practice, not sure if it happens if the software maintained properly). Also, at
some point one of the nodes was not able to join back (even after re-initializing from scratch), but again this hopefully not happening if the
service is not pereodically recreated.
- Gluster/Block would be a good solution if volume detachment is fixed. As it stands, we don't have HA without manual intervention. Furthermore, the
MyISAM recovery is quite slow.
- HostMount will be using our 3-node storage optimally. But if something crashes there is 1 week to recache the data.
Gluster/Block
=============
The idea is pretty simple. A standard gluster file system is used to store a 'block' files (just a normal files). This files are used as block devices
with single-pod access policy. GFApi interface is used to access the data on Gluster (avoiding context switches) and is exposed over iSCSI to the clients.
There are couple of problems with configuration and run-time.
- The default Gluster containers while complain about rpcbind. We are using host networking in this case and the required ports (111) between container
and the host system conflicts. We, however, are able just to use the host rpcbind. Consequently, the rpcbind should be removed from the Gluster container
and the requirements removed from gluster-blockd systemd service. It is still worth checking that the port is accessible from the container (but it
should). We additionally also need 'iscsi-initiator-utils' in the container.
- Only a single pod should have access to the block device. Consequnetly, when the volume is attached to the client, other pods can't use it any more.
The problem starts if node running pod dies. It is not perfectly handled by OpenShift now. The volume remains attached to the pod in the 'Unknown' state
until it manually killed. Only, then, after another delay it is detached and available for replacement pod (which will struggle in ConteinerCreating
phase until then). The pods in 'Unknown' state is not easy to kill.
oc delete --force --grace-period=0 pod/mysql-1-m4wcq
- Heketi is buggy.
* If something goes wrong, it starts create multitudes of Gluster volumes and finally crashes with broken database. It is possible to remove the
volumes and recover database from backup, but it is time consuming and unreliable for HA solution.
* Particularly, this happens if we try to allocate more disk-space when available. The OpenShift configures the size of Gluster file system used
to back block devices. It is 100 GB by default. If we specify 500Gi in pvc, it will try to create 15 such devices (another maximum configured by
openshift) before crashing.
* Overall, I'd rather only use the manual provisioning.
- Also without heketi it is still problematic (may be it is better with official RH container running on GlusterFS 3.7), but I'd not check... We
can try again with GlusterFS 4.1. There are probably multiple problems, but
* GlusterFS may fail on one of the nodes (showing it up and running). If any of the block services have problems communicating with local gluster
daemon, most requests (info/list will still work, but slow) to gluster daemon will timeout.
Galera
======
- To bring new cluster up, there is several steps.
* All members need to initialize standard standalone databases
* One node should perform initialization and other nodes join after it is completed.
* The nodes will delete their mysql folders and re-synchronize from the first node.
* Then, cluster will be up and all nodes in so called primary state.
- The procedure is similar for crash recovery:
* If a node leaves the cluster, it may just come back and be re-sycnronized from other
cluster members if there is a quorum. For this reason, it is necessary to keep at le
ast 3 nodes running.
* If all nodes crashed, then again one node should restart the cluster and others join
later. For older versions, it is necessary to run mysqld with '--wsrep-new-cluster'.
The new tries to automatize it and will recover automatically if 'safe_to_bootstrap' = 1
in 'grstate.dat' in mysql data folder. If cluster was shat down orderly, the Galera will
set it automatically on the last node to stop the service. In case of a crash, however,
it has to be configured manually on the most up to date node. IMIMPORTANT, it should be
set only on one of the nodes. Otherwise, the cluster will get nearly unrecoverable.
* So, to recover failed cluster (unless automatic recovery works) we must revert to manual
procedure now. There is 'gmanager' pod which can be scalled to 3 nodes. We recover a full
cluster in this pods in required order. Then, we stop first node and init a statefulSet.
As first node in the statefulSet is ready, we stop second node in 'gmanager' and so on.
- IMPORTANT: Synchrinization only works for INNODB tables. Furthermore, binary logging should
be turned on (yes, it is possible to turn it off and there is no complains, but only the table
names are synchronized, no data is pushed between the nodes).
- OpenShift uses 'StatefulSet' to perform such initialization. Particularly, it starts first
node and waits until it is running (and ready) before starting next one.
* Now the nodes need to talk between each other. The 'headless' service is used for that.
Unlinke standard service, the DNS does not load balance service pods, but returns IPs of
all service members if appropriate DNS request is send (SRV). In Service spec we specify.
clusterIP: None - old version
For clients we still need a load-balancing service. So, we need to add a second service
to serve their needs.
* To decide if it should perform cluster initialization, the node tries to resolve members
of the service. If it is alone, it initializes the cluster. Otherwise, tries to join the other
members already registered in the service. The problem is that by default, OpenShift only
will add member when it is ready (Readyness check). Consequently, all nodes will try to
initialize. There is two methods to prevent it. One is working up to 3.7 and other 3.8 up,
but it is no harm to use both for now).
The new is to set in Service spec:
publishNotReadyAddresses: True
The old is to specify in Service metadata.annotations:
service.alpha.kubernetes.io/tolerate-unready-endpoints: true
* Still, we should quickly check for peers until other pods had chance to start.
* Furthermore, there is some differneces to 'dc' definition. We need to specify 'serviceName'
in the StatefulSet spec.
serviceName: adei-ss
There are few other minor differences. For instance, the 'selector' have more flexible notation
and should include 'matchLabels' before specifying the 'pod' selector, etc.
- IMPORTANT: If we use hostPath (or even hostPath based pv/pvc pair), the pods will be assigned
to the nodes randomly. This is not ideal if we want to shutdown and restart cluster. In general,
we always want the first pod to end-up on the same storage as it will be likely the one able to
boostrap. Instead, we should use 'local' volume feature (alpha in OpenShift 3.7 and should be
enabled in origin-node and origin-master configurations). Then, openshift 'pvc' to specific node
and the 'pod' executed on the node where its 'pvc' is bounded.
- IMPORTANT: StatefulSet ensures ordering and local volume data binding. Consequently, we should
not destroy StatefulSet object which save the state information. Otherwise, the node assignments
will chnage and cluster would be hard to impossible to recover.
- Another problem of our setup is slow internal network (since bridging over Infiniband is not
possible). One solution to overcome this is to run Galera using 'hostNetwork'. Then, however,
the 'peer-finder' is failing. It tries to match the service names to its 'hostname' expecting
that it will be in the form of 'galera-0.galera.adei.svc.cluster.local', but with host networking
enabled the actual hostname is used (i.e. ipekatrin1.ipe.kit.edu). I have to patch peer-finder
to resolve IPs and try to match the IPs.
- To check current status of the cluster
SHOW STATUS LIKE 'wsrep_cluster_size';
Master/Slave replication
========================
- This configuration seems more robuts, but strangely has a lot of performance issues on the
slave side. Network is not a problem, it is able to get logs from the master, but it is significantly
slower in applying it. The main performance killer is disk sync operations triggered by 'sync_binlog',
INNODB log flashing, etc. Disabling it allows to bring performance on reasonable level. Still,
the master is caching at about 6-8 MB/s and slave at 4-5 MB/s only.
|