1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
|
DOs and DONTs
=============
Here we discuss things we should do and we should not do!
- Scaling up cluster is normally problem-less. Both nodes & masters can be added
fast and without much troubles afterwards.
- Upgrade procedure may cause the problems. The main trouble that many pods are
configured to use the 'latest' tag. And the latest versions has latest problems (some
of the tags can be fixed to actual version, but finding that is broken and why takes
a lot of effort)...
* Currently, there is problems if 'kube-service-catalog' is updated (see discussion
in docs/upgrade.txt). While it seems nothing really changes, the connection between
apiserver and etcd breaks down (at least for health checks). The intallation reamins
pretty much usable, but not in healthy state. This particular update is blocked by
setting.
openshift_enable_service_catalog: false
Then, it is left in 'Error' state, but can be easily recovered by deteleting and
allowing system to re-create a new pod.
* However, as cause is unclear, it is possible that something else with break as time
passes and new images are released. It is ADVISED to check upgrade in staging first.
* During upgrade also other system pods may stuck in Error state (as explained
in troubleshooting) and block the flow of upgrade. Just delete them and allow
system to re-create to continue.
* After upgrade, it is necessary to verify that all pods are operational and
restart ones in 'Error' states.
- Re-running install will break on heketi. And it will DESTROY heketi topology!
DON"T DO IT! Instead a separate components can be re-installed.
* For instance to reinstall 'openshift-ansible-service-broker' use
openshift-install-service-catalog.yml
* There is a way to prevent plays from touching heketi, we need to define
openshift_storage_glusterfs_is_missing: False
openshift_storage_glusterfs_heketi_is_missing: False
But I am not sure if it is only major issue.
- Few administrative tools could cause troubles. Don't run
* oc adm diagnostics
Failures / Immidiate
========
- We need to remove the failed node from etcd cluster
etcdctl3 --endpoints="192.168.213.1:2379" member list
etcdctl3 --endpoints="192.168.213.1:2379" member remove <hexid>
- Further, the following is required on all remaining nodes if the node is forever gone
* Delete node
oc delete node
* Remove it also from /etc/etcd.conf on all nodes ETCD_INITIAL_CLUSTER
* Remove failed nodes from 'etcdClinetInfo' section in /etc/origin/master/master-config.yaml
systemctl restart origin-master-api.service
Scaling / Recovery
=======
- One important point.
* If we lost data on the storage node, it should be re-added with different name (otherwise
the GlusterFS recovery would be significantly more complicated)
* If Gluster bricks are preserved, we may keep the name. I have not tried, but according to
documentation, it should be possible to reconnect it back and synchronize. Still it may be
easier to use a new name again to simplify procedure.
* Simple OpenShift nodes may be re-added with the same name, no problem.
- Next we need to perform all prepartion steps (the --limit should not be applied as we normally
need to update CentOS on all nodes to synchronize software versions; list all nodes in /etc/hosts
files; etc).
./setup.sh -i staging prepare
- The OpenShift scale is provided as several ansible plays (scale-masters, scale-nodes, scale-etcd).
* Running 'masters' will also install configured 'nodes' and 'etcd' daemons
* I guess running 'nodes' will also handle 'etcd' daemons, but I have not checked.
Problems
--------
- There should be no problems if a simple node crashed, but things may go wrong if one of the
masters is crashed. And things definitively will go wrong if complete cluster will be cut from the power.
* Some pods will be stuck polling images. This happens if node running docker-registry have crashed
and the persistent storage was not used to back the registry. It can be fixed by re-schedulling build
and roling out the latest version from dc.
oc -n adei start-build adei
oc -n adei rollout latest mysql
OpenShift will trigger rollout automatically in some time, but it will take a while. The builds
should be done manually it seems.
* In case of long outtage some CronJobs will stop execute. The reason is some protection against
excive loads and missing defaults. Fix is easy, just setup how much time the OpenShift scheduller
allows to CronJob to start before considering it failed:
oc -n adei patch cronjob/adei-autogen-update --patch '{ "spec": {"startingDeadlineSeconds": 10 }}'
- if we forgot to remove old host from etcd cluster, the OpenShift node will be configured, but etcd
will not be installed. We need, then, to remove the node as explained above and run scale of etcd
cluster.
* In multiple ocasions, the etcd daemon has failed after reboot and needed to be resarted manually.
If half of the daemons is broken, the 'oc' will block.
Storage / Recovery
=======
- We have some manually provisioned resources which needs to be fixed.
* GlusterFS endpoints should be pointing to new nodes.
* If use use Gluster/Block storage all 'pv' refer iscsi 'portals'. They also has to be apdated to
new server names. I am not sure how this handled for auto-provisioned resources.
- Furthermore, it is necessary to add glusterfs nodes on a new storage nodes. It is not performed
automatically by scale plays. The 'glusterfs' play should be executed with additional options
specifying that we are just re-configuring nodes. We can check if all pods are serviced
oc -n glusterfs get pods -o wide
Both OpenShift and etcd clusters should be in proper state before running this play. Fixing and re-running
should be not an issue.
- More details:
https://docs.openshift.com/container-platform/3.7/day_two_guide/host_level_tasks.html
Heketi
------
- With heketi things are straighforward, we need to mark node broken. Then heketi will automatically move the
bricks to other servers (as he thinks fit).
* Accessing heketi
heketi-cli -s http://heketi-storage-glusterfs.openshift.suren.me --user admin --secret "$(oc get secret heketi-storage-admin-secret -n glusterfs -o jsonpath='{.data.key}' | base64 -d)"
* Gettiing required ids
heketi-cli topology info
* Removing node
heketi-cli node info <failed_node_id>
heketi-cli node disable <failed_node_id>
heketi-cli node remove <failed_node_id>
* Thats it. A few self-healing daemons are running which should bring the volumes in order automatically.
* The node will still persist in heketi topology as failed, but will not be used ('node delete' potentially could destroy it, but it is failin)
- One problem with heketi, it may start volumes before bricks get ready. Consequently, it may run volumes with several bricks offline. It should be
checked and fixed by restarting the volumes.
KaaS Volumes
------------
There is two modes.
- If we migrated to a new server, we need to migrate bricks (force is required because
the source break is dead and data can't be copied)
gluster volume replace-brick <volume> <src_brick> <dst_brick> commit force
* There is healing daemons running and nothing else has to be done.
* There play and scripts available to move all bricks automatically
- If we kept the name and the data is still there, it should be also relatively easy
to perform migration (not checked). We also should have backups of all this data.
* Ensure Gluster is not running on the failed node
oadm manage-node ipeshift2 --schedulable=false
oadm manage-node ipeshift2 --evacuate
* Verify the gluster pod is not active. It may be running, but not ready.
Could be double checked with 'ps'.
oadm manage-node ipeshift2 --list-pods
* Get the original Peer UUID of the failed node (by running on healthy node)
gluster peer status
* And create '/var/lib/glusterd/glusterd.info' similar to the one on the
healthy nodes, but with the found UUID.
* Copy peers from the healthy nodes to /var/lib/glusterd/peers. We need to
copy from 2 nodes as node does not hold peer information on itself.
* Create mount points and re-schedule gluster pod. See more details
https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3/html/administration_guide/sect-replacing_hosts
* Start healing
gluster volume heal VOLNAME full
- However, if data is lost, it is quite complecated to recover using the same server name.
We should rename the server and use first approach instead.
Scaling
=======
- If we use container native routing, we need to add routes to new nodes on the Infiniband routes,
see docs:
https://docs.openshift.com/container-platform/3.7/install_config/configuring_native_container_routing.html#install-config-configuring-native-container-routing
Basically, the Infiniband switch should send packets destined to the network 11.11.<hostid>.0/24 to corresponding node, i.e. 192.168.13.<hostid>
We also have currently serveral assumptions which will probably not hold true for larger clusters
- Gluster
To simplify matters we just reference servers in the storage group manually
Arbiter may work for several groups and we should define several brick path in this case
|