summaryrefslogtreecommitdiffstats
path: root/docs/managment.txt
blob: 9436c3c8e4fe244be558028ba73ac2f12df6568d (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
DOs and DONTs
=============
 Here we discuss things we should do and we should not do!
 
 - Scaling up cluster is normally problem-less. Both nodes & masters can be added
 fast and without much troubles afterwards. 

 - Upgrade procedure may cause the problems. The main trouble that many pods are 
 configured to use the 'latest' tag. And the latest versions has latest problems (some
 of the tags can be fixed to actual version, but finding that is broken and why takes
 a lot of effort)...
    * Currently, there is problems if 'kube-service-catalog' is updated  (see discussion
    in docs/upgrade.txt). While it seems nothing really changes, the connection between
    apiserver and etcd breaks down (at least for health checks). The intallation reamins
    pretty much usable, but not in healthy state. This particular update is blocked by
    setting. 
        openshift_enable_service_catalog: false
    Then, it is left in 'Error' state, but can be easily recovered by deteleting and 
    allowing system to re-create a new pod. 
    * However, as cause is unclear, it is possible that something else with break as time
    passes and new images are released. It is ADVISED to check upgrade in staging first.
    * During upgrade also other system pods may stuck in Error state (as explained
    in troubleshooting) and block the flow of upgrade. Just delete them and allow
    system to re-create to continue.
    * After upgrade, it is necessary to verify that all pods are operational and 
    restart ones in 'Error' states.

 - Re-running install will break on heketi. And it will DESTROY heketi topology!
 DON"T DO IT! Instead a separate components can be re-installed.
    * For instance to reinstall 'openshift-ansible-service-broker' use
         openshift-install-service-catalog.yml
    * There is a way to prevent plays from touching heketi, we need to define
        openshift_storage_glusterfs_is_missing: False
        openshift_storage_glusterfs_heketi_is_missing: False
    But I am not sure if it is only major issue.

 - Few administrative tools could cause troubles. Don't run
    * oc adm diagnostics


Failures / Immidiate
========
 - We need to remove the failed node from etcd cluster
    etcdctl3 --endpoints="192.168.213.1:2379" member list
    etcdctl3 --endpoints="192.168.213.1:2379" member remove <hexid>

 - Further, the following is required on all remaining nodes if the node is forever gone
    * Delete node 
        oc delete node
    * Remove it also from /etc/etcd.conf on all nodes ETCD_INITIAL_CLUSTER
    * Remove failed nodes from 'etcdClinetInfo' section in /etc/origin/master/master-config.yaml
        systemctl restart origin-master-api.service 
    
Scaling / Recovery
=======
 - One important point.
  * If we lost data on the storage node, it should be re-added with different name (otherwise
  the GlusterFS recovery would be significantly more complicated)
  * If Gluster bricks are preserved, we may keep the name. I have not tried, but according to
  documentation, it should be possible to reconnect it back and synchronize. Still it may be 
  easier to use a new name again to simplify procedure.
  * Simple OpenShift nodes may be re-added with the same name, no problem.

 - Next we need to perform all prepartion steps (the --limit should not be applied as we normally
 need to update CentOS on all nodes to synchronize software versions; list all nodes in /etc/hosts 
 files; etc).
    ./setup.sh -i staging prepare

 - The OpenShift scale is provided as several ansible plays (scale-masters, scale-nodes, scale-etcd).
  * Running 'masters' will also install configured 'nodes' and 'etcd' daemons
  * I guess running 'nodes' will also handle 'etcd' daemons, but I have not checked.

Problems
--------
 - There should be no problems if a simple node crashed, but things may go wrong if one of the 
 masters is crashed. And things definitively will go wrong if complete cluster will be cut from the power.
  * Some pods will be stuck polling images. This happens if node running docker-registry have crashed
  and the persistent storage was not used to back the registry. It can be fixed by re-schedulling build 
  and roling out the latest version from dc.
        oc -n adei start-build adei
        oc -n adei rollout latest mysql
    OpenShift will trigger rollout automatically in some time, but it will take a while. The builds 
    should be done manually it seems.
  * In case of long outtage some CronJobs will stop execute. The reason is some protection against
  excive loads and missing defaults. Fix is easy, just setup how much time the OpenShift scheduller
  allows to CronJob to start before considering it failed:
    oc -n adei patch cronjob/adei-autogen-update --patch '{ "spec": {"startingDeadlineSeconds": 10 }}'

 - if we forgot to remove old host from etcd cluster, the OpenShift node will be configured, but etcd
 will not be installed. We need, then, to remove the node as explained above and run scale of etcd
 cluster.
    * In multiple ocasions, the etcd daemon has failed after reboot and needed to be resarted manually.
    If half of the daemons is broken, the 'oc' will block.    

    

Storage / Recovery
=======
 - We have some manually provisioned resources which needs to be fixed.
    * GlusterFS endpoints should be pointing to new nodes.
    * If use use Gluster/Block storage all 'pv' refer iscsi 'portals'. They also has to be apdated to 
    new server names. I am not sure how this handled for auto-provisioned resources.
 - Furthermore, it is necessary to add glusterfs nodes on a new storage nodes. It is not performed 
 automatically by scale plays. The 'glusterfs' play should be executed with additional options
 specifying that we are just re-configuring nodes. We can check if all pods are serviced
    oc -n glusterfs get pods -o wide
 Both OpenShift and etcd clusters should be in proper state before running this play. Fixing and re-running
 should be not an issue.

 - More details:
    https://docs.openshift.com/container-platform/3.7/day_two_guide/host_level_tasks.html




Heketi
------
 - With heketi things are straighforward, we need to mark node broken. Then heketi will automatically move the
 bricks to other servers (as he thinks fit).
    * Accessing heketi
        heketi-cli -s http://heketi-storage-glusterfs.openshift.suren.me --user admin --secret "$(oc get secret heketi-storage-admin-secret -n glusterfs -o jsonpath='{.data.key}' | base64 -d)"  
    * Gettiing required ids
        heketi-cli topology info
    * Removing node
        heketi-cli node info <failed_node_id>
        heketi-cli node disable <failed_node_id>
        heketi-cli node remove <failed_node_id>
    * Thats it. A few self-healing daemons are running which should bring the volumes in order automatically.
    * The node will still persist in heketi topology as failed, but will not be used ('node delete' potentially could destroy it, but it is failin)

 - One problem with heketi, it may start volumes before bricks get ready. Consequently, it may run volumes with several bricks offline. It should be
 checked and fixed by restarting the volumes.
 
KaaS Volumes
------------
 There is two modes. 
 - If we migrated to a new server, we need to migrate bricks (force is required because
 the source break is dead and data can't be copied)
        gluster volume replace-brick <volume> <src_brick> <dst_brick>  commit force
    * There is healing daemons running and nothing else has to be done.
    * There play and scripts available to move all bricks automatically

 - If we kept the name and the data is still there, it should be also relatively easy
 to perform migration (not checked). We also should have backups of all this data.
    * Ensure Gluster is not running on the failed node
        oadm manage-node ipeshift2 --schedulable=false
        oadm manage-node ipeshift2 --evacuate
    * Verify the gluster pod is not active. It may be running, but not ready.
    Could be double checked with 'ps'.
        oadm manage-node ipeshift2 --list-pods
    * Get the original Peer UUID of the failed node (by running on healthy node)
        gluster peer status
    * And create '/var/lib/glusterd/glusterd.info' similar to the one on the 
    healthy nodes, but with the found UUID.
    * Copy peers from the healthy nodes to /var/lib/glusterd/peers. We need to
    copy from 2 nodes as node does not hold peer information on itself.
    * Create mount points and re-schedule gluster pod. See more details
        https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3/html/administration_guide/sect-replacing_hosts
    * Start healing
        gluster volume heal VOLNAME full

 - However, if data is lost, it is quite complecated to recover using the same server name. 
 We should rename the server and use first approach instead.
 
 
 
Scaling
=======
 - If we use container native routing, we need to add routes to new nodes on the Infiniband routes, 
 see docs:
    https://docs.openshift.com/container-platform/3.7/install_config/configuring_native_container_routing.html#install-config-configuring-native-container-routing
 Basically, the Infiniband switch should send packets destined to the network 11.11.<hostid>.0/24 to corresponding node, i.e. 192.168.13.<hostid>
    
We also have currently serveral assumptions which will probably not hold true for larger clusters
 - Gluster
    To simplify matters we just reference servers in the storage group manually
    Arbiter may work for several groups and we should define several brick path in this case