docs/databases.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154

- The storage for HA datbases is problematic. There is several ways to organize storage. I list major 
  characteristics here (INNODB is generally faster, but takes about 20% more disk space. Initially it
  significantly faster and takes 5x disk space, but it normalizes...)

 Method         Database                Performance     Clnt/Cache      MySQL           Gluster         HA
 HostMount      MyISAM/INNODB           8 MB/s          fast            250%            -               Nope. But otherwise least problems to run.        
 Gluster        MyISAM (no logs)        1 MB/s          unusable        150%            600-800%        Perfect. But too slow (up to completely unusable if bin-logs are on). Slow MyISAM recovery!
 Gluster/Block  MyISAM (no logs)        5 MB/s          slow, but OK    200%            ~ 50%           No problems on reboot, but requires manual work if node crashes to detach volume.
 Galera         INNODB                  3.5 MB/s        fast            3 x 200%        -               Should be perfect, but I am not sure about automatic recovery...
 Galera/Hostnet INNODB                  4.6 MB/s        fast            3 x 200%        -               
 MySQL Slaves   INNODB                  5-8 MB/s        fast            2 x 250%        -               Available data is HA, but caching is not. We can easily turn the slave to master.
 DRBD           MyISAM (no logs)        4-6 exp.        ?                                               I expect it as an faster option, but does not fit the OpenShift concept that well.
 

 Gluster is a way too slow for anything. If node crashes, MyISAM tables may be left in corrupted state. The recovery will take ages to complete.
The Gluster/Block is faster, but HA suffers. The volume is attached to the pod running on crashed node. It seems not detached automatically until
the failed pod (in Unknown state) is killed with 
    oc -n adei delete --force --grace-period=0 pod mysql-1-m4wcq
Then, after some delay it is re-attached to the new running pod. Technically, we can run kind of monitoring service which will detect such nodes
and restart. Still, this solution is limited to MyISAM with binary logging disabled. Unlike simple Gluster solution, the clients may use the system
while caching is going, but is quite slow. The main trouble is MyISAM corruption, the recovery is slow. 

 Galera is slower when Gluster/Block, but is fully available. The clients have also more servers to query data from. The cluster start-up is a bit
tricky and I am not sure that everything will work smoothely now. Some tunning may be necessary. Furthermore, it seems if cluster is crashed, we 
can recover from one of the nodes, but all the data will be destroyed on other members and they would pull the complete dataset. The synchronization
is faster when caching (~ 140 MB/s), but it wil still take about 10 hours to synchronize 5 TB of KATRIN data.
 
So, there is no realy a full HA capable solution at the moment. The most reasonable seems compromising on caching HA.
 - MySQL with slaves.  The asynchronous replication should be significantly faster when Galera. The passthrough to source databases will be working 
 (i.e. status displays), current data is available. And we can easily switch the master if necessary.

The other reasonable options have some problems at the moment and can't be used.
 - Galera. Is a fine solution. The caching is still quite slow. If networking problem is solved (see performance section in network.txt) or host
 networking is used, it more-or-less on pair with Gluster/Block, but provides much better service to the data reading clients. However, extra 
 investigations are required to understand robustness of crash recovery. In some cases, after a crash Galera was performing a full resync of all
 data (but I was re-creating statefulset which is not recommended practice, not sure if it happens if the software maintained properly). Also, at
 some point  one of the nodes was not able to join back (even after re-initializing from scratch), but again this hopefully not happening if the
 service is not pereodically recreated.
  - Gluster/Block would be a good solution if volume detachment is fixed. As it stands, we don't have HA without manual intervention. Furthermore, the
 MyISAM recovery is quite slow.
 - HostMount will be using our 3-node storage optimally. But if something crashes there is 1 week to recache the data.

Gluster/Block
=============
 The idea is pretty simple. A standard gluster file system is used to store a 'block' files (just a normal files). This files are used as block devices
 with single-pod access policy. GFApi interface is used to access the data on Gluster (avoiding context switches) and is exposed over iSCSI to the clients.

 There are couple of problems with configuration and run-time.
 - The default Gluster containers while complain about rpcbind. We are using host networking in this case and the required ports (111) between container
 and the host system conflicts. We, however, are able just to use the host rpcbind. Consequently, the rpcbind should be removed from the Gluster container
 and the requirements removed from gluster-blockd systemd service. It is still worth checking that the port is accessible from the container (but it 
 should). We additionally also need 'iscsi-initiator-utils' in the container.

 - Only a single pod should have access to the block device. Consequnetly, when the volume is attached to the client, other pods can't use it any more.
 The problem starts if node running pod dies. It is not perfectly handled by OpenShift now. The volume remains attached to the pod in the 'Unknown' state
 until it manually killed. Only, then, after another delay it is detached and available for replacement pod (which will struggle in ConteinerCreating 
 phase until then). The pods in 'Unknown' state is not easy to kill. 
    oc delete --force --grace-period=0 pod/mysql-1-m4wcq

 - Heketi is buggy. 
  * If something goes wrong, it starts create multitudes of Gluster volumes and finally crashes with broken database. It is possible to remove the 
  volumes and recover database from backup, but it is time consuming and unreliable for HA solution.
  * Particularly, this happens if we try to allocate more disk-space when available. The OpenShift configures the size of Gluster file system used
  to back block devices. It is 100 GB by default. If we specify 500Gi in pvc, it will try to create 15 such devices (another maximum configured by
  openshift) before crashing.
  * Overall, I'd rather only use the manual provisioning.

 - Also without heketi it is still problematic (may be it is better with official RH container running on GlusterFS 3.7), but I'd not check... We
 can try again with GlusterFS 4.1. There are probably multiple problems, but
  * GlusterFS may fail on one of the nodes (showing it up and running). If any of the block services have problems communicating with local gluster
  daemon, most requests (info/list will still work, but slow) to gluster daemon will timeout.
  
Galera
======
 - To bring new cluster up, there is several steps.
  * All members need to initialize standard standalone databases
  * One node should perform initialization and other nodes join after it is completed.
  * The nodes will delete their mysql folders and re-synchronize from the first node.
  * Then, cluster will be up and all nodes in so called primary state.

 - The procedure is similar for crash recovery:
  * If a node leaves the cluster, it may just come back and be re-sycnronized from other
  cluster members if there is a quorum. For this reason, it is necessary to keep at le
  ast 3 nodes running.
  * If all nodes crashed, then again one node should restart the cluster and others join
  later. For older versions, it is necessary to run mysqld with '--wsrep-new-cluster'.
  The new tries to automatize it and will recover automatically if 'safe_to_bootstrap' = 1
  in 'grstate.dat' in mysql data folder. If cluster was shat down orderly, the Galera will
  set it automatically on the last node to stop the service. In case of a crash, however,
  it has to be configured manually on the most up to date node. IMIMPORTANT, it should be 
  set only on one of the nodes. Otherwise, the cluster will get nearly unrecoverable. 
  * So, to recover failed cluster (unless automatic recovery works) we must revert to manual
  procedure now. There is 'gmanager' pod which can be scalled to 3 nodes. We recover a full 
  cluster in this pods in required order. Then, we stop first node and init a statefulSet.
  As first node in the statefulSet is ready, we stop second node in 'gmanager' and so on.
  
 - IMPORTANT: Synchrinization only works for INNODB tables. Furthermore, binary logging should 
 be turned on (yes, it is possible to turn it off and there is no complains, but only the table
 names are synchronized, no data is pushed between the nodes).
 
 - OpenShift uses 'StatefulSet' to perform such initialization. Particularly, it starts first
 node and waits until it is running (and ready) before starting next one. 
  * Now the nodes need to talk between each other. The 'headless' service is used for that. 
  Unlinke standard service, the DNS does not load balance service pods, but returns IPs of
  all service members if appropriate DNS request is send (SRV). In Service spec we specify.
        clusterIP: None                                 - old version
  For clients we still need a load-balancing service. So, we need to add a second service 
  to serve their needs.
  * To decide if it should perform cluster initialization, the node tries to resolve members
  of the service. If it is alone, it initializes the cluster. Otherwise, tries to join the other
  members already registered in the service. The problem is that by default, OpenShift only
  will add member when it is ready (Readyness check). Consequently, all nodes will try to 
  initialize. There is two methods to prevent it. One is working up to 3.7 and other 3.8 up, 
  but it is no harm to use both for now). 
    The new is to set in Service spec:
        publishNotReadyAddresses: True
    The old is to specify in Service metadata.annotations:
        service.alpha.kubernetes.io/tolerate-unready-endpoints: true
  * Still, we should quickly check for peers until other pods had chance to start.
  * Furthermore, there is some differneces to 'dc' definition. We need to specify 'serviceName'
  in the StatefulSet spec.
      serviceName: adei-ss
  There are few other minor differences. For instance, the 'selector' have more flexible notation
  and should include 'matchLabels' before specifying the 'pod' selector, etc.

 - IMPORTANT: If we use hostPath (or even hostPath based pv/pvc pair), the pods will be assigned 
 to the nodes randomly. This is not ideal if we want to shutdown and restart cluster. In general, 
 we always want the first pod to end-up on the same storage as it will be likely the one able to
 boostrap. Instead, we should use 'local' volume feature (alpha in OpenShift 3.7 and should be
 enabled in origin-node and origin-master configurations). Then, openshift 'pvc' to specific node
 and the 'pod' executed on the node where its 'pvc' is bounded.
 
 - IMPORTANT: StatefulSet ensures ordering and local volume data binding. Consequently, we should 
 not destroy StatefulSet object which save the state information. Otherwise, the node assignments 
 will chnage and cluster would be hard to impossible to recover.

  - Another problem of our setup is slow internal network (since bridging over Infiniband is not 
  possible). One solution to overcome this is to run Galera using 'hostNetwork'. Then, however,
  the 'peer-finder' is failing. It tries to match the service names to its 'hostname' expecting
  that it will be in the form of 'galera-0.galera.adei.svc.cluster.local', but with host networking
  enabled the actual hostname is used (i.e. ipekatrin1.ipe.kit.edu). I have to patch peer-finder
  to resolve IPs and try to match the IPs.
  
 - To check current status of the cluster
        SHOW STATUS LIKE 'wsrep_cluster_size';
 
Master/Slave replication
========================
 - This configuration seems more robuts, but strangely has a lot of performance issues on the 
 slave side. Network is not a problem, it is able to get logs from the master, but it is significantly
 slower in applying it. The main performance killer is disk sync operations triggered by 'sync_binlog',
 INNODB log flashing, etc. Disabling it allows to bring performance on reasonable level. Still,
 the master is caching at about 6-8 MB/s and slave at 4-5 MB/s only.