diff options
author | Suren A. Chilingaryan <csa@suren.me> | 2018-03-11 19:56:38 +0100 |
---|---|---|
committer | Suren A. Chilingaryan <csa@suren.me> | 2018-03-11 19:56:38 +0100 |
commit | f3c41dd13a0a86382b80d564e9de0d6b06fb1dbf (patch) | |
tree | 3522ce77203da92bb2b6f7cfa2b0999bf6cc132c /docs/troubleshooting.txt | |
parent | 6bc3a3ac71e11fb6459df715536fec373c123a97 (diff) | |
download | ands-f3c41dd13a0a86382b80d564e9de0d6b06fb1dbf.tar.gz ands-f3c41dd13a0a86382b80d564e9de0d6b06fb1dbf.tar.bz2 ands-f3c41dd13a0a86382b80d564e9de0d6b06fb1dbf.tar.xz ands-f3c41dd13a0a86382b80d564e9de0d6b06fb1dbf.zip |
Various fixes before moving to hardware installation
Diffstat (limited to 'docs/troubleshooting.txt')
-rw-r--r-- | docs/troubleshooting.txt | 210 |
1 files changed, 210 insertions, 0 deletions
diff --git a/docs/troubleshooting.txt b/docs/troubleshooting.txt new file mode 100644 index 0000000..b4ac8e7 --- /dev/null +++ b/docs/troubleshooting.txt @@ -0,0 +1,210 @@ +The services has to be running +------------------------------ + Etcd: + - etcd + + Node: + - origin-node + + Master nodes: + - origin-master-api + - origin-master-controllers + - origin-master is not running + + Required Services: + - lvm2-lvmetad.socket + - lvm2-lvmetad.service + - docker + - NetworkManager + - firewalld + - dnsmasq + - openvswitch + + Extra Services: + - ssh + - ntp + - openvpn + - ganesha (on master nodes, optional) + +Pods has to be running +---------------------- + Kubernetes System + - kube-service-catalog/apiserver + - kube-service-catalog/controller-manager + + OpenShift Main Services + - default/docker-registry + - default/registry-console + - default/router (3 replicas) + - openshift-template-service-broker/api-server (daemonset, on all nodes) + + OpenShift Secondary Services + - openshift-ansible-service-broker/asb + - openshift-ansible-service-broker/asb-etcd + + GlusterFS + - glusterfs-storage (daemonset, on all storage nodes) + - glusterblock-storage-provisioner-dc + - heketi-storage + + Metrics (openshift-infra): + - hawkular-cassandra + - hawkular-metrics + - heapster + + +Debugging +========= + - Ensure system consistency as explained in 'consistency.txt' (incomplete) + - Check current pod logs and possibly logs for last failed instance + oc logs <pod name> --tail=100 [-p] - dc/name or ds/name as well + - Verify initialization steps (check if all volumes are mounted) + oc describe <pod name> + - It worth looking the pod environment + oc env po <pod name> --list + - It worth connecting running container with 'rsh' session and see running processes, + internal logs, etc. The 'debug' session will start a new instance of the pod. + - If try looking if corresponding pv/pvc are bound. Check logs for pv. + * Even if 'pvc' is bound. The 'pv' may have problems with its backend. + * Check logs here: /var/lib/origin/plugins/kubernetes.io/glusterfs/ + - Another frequent problems is failing 'postStart' hook. Or 'livenessProbe'. As it + immediately crashes it is not possible to connect. Remedies are: + * Set larger initial delay to check the probe. + * Try to remove hook and execute it using 'rsh'/'debug' + - Determine node running the pod and check the host logs in '/var/log/messages' + * Particularly logs of 'origin-master-controllers' are of interest + - Check which docker images are actually downloaded on the node + docker images + +network +======= + - There is a NetworkManager script which should adjust /etc/resolv.conf to use local dnsmasq server. + This is based on '/etc/NetworkManager/dispatcher.d/99-origin-dns.sh' which does not play well + if OpenShift is running on non-default network interface. I provided a patched version, but it + worth verifying + * that nameserver is pointing to the host itself (but not localhost, this is important + to allow running pods to use it) + * that correct upstream nameservers are listed in '/etc/dnsmasq.d/origin-upstream-dns.conf' + * In some cases, it was necessary to restart dnsmasq (but it could be also for different reasons) + If script misbehaves, it is possible to call it manually like that + DEVICE_IFACE="eth1" ./99-origin-dns.sh eth1 up + + +etcd (and general operability) +==== + - Few of this sevices may seem running accroding to 'systemctl', but actually misbehave. Then, it + may be needed to restart them manually. I have noticed it with + * lvm2-lvmetad.socket (pvscan will complain on problems) + * node-origin + * etcd but BEWARE of too entusiastic restarting: + - However, restarting etcd many times is BAD as it may trigger a severe problem with + 'kube-service-catalog/apiserver'. The bug description is here + https://github.com/kubernetes/kubernetes/issues/47131 + - Due to problem mentioned above, all 'oc' queries are very slow. There is not proper + solution suggested. But killing the 'kube-service-catalog/apiserver' helps for a while. + The pod is restarted and response times are back in order. + * Another way to see this problem is quering 'healthz' service which would tell that + there is too many clients and, please, retry later. + curl -k https://apiserver.kube-service-catalog.svc/healthz + + - On node crash, the etcd database may get corrupted. + * There is no easy fix. Backup/restore is not working. + * Easiest option is to remove the failed etcd from the cluster. + etcdctl3 --endpoints="192.168.213.1:2379" member list + etcdctl3 --endpoints="192.168.213.1:2379" member remove <hexid> + * Add it to [new_etcd] section in inventory and run openshift-etcd to scale-up etcd cluster. + + - There is a helth check provided by the cluster + curl -k https://apiserver.kube-service-catalog.svc/healthz + it may complain about etcd problems. It seems triggered by OpenShift upgrade. The real cause and + remedy is unclear, but the installation is mostly working. Discussion is in docs/upgrade.txt + + - There is also a different etcd which is integral part of the ansible service broker: + 'openshift-ansible-service-broker/asb-etcd'. If investigated with 'oc logs' it complains + on: + 2018-03-07 20:54:48.791735 I | embed: rejected connection from "127.0.0.1:43066" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority", ServerName "") + WARNING: 2018/03/07 20:54:48 Failed to dial 0.0.0.0:2379: connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate"; please retry. + Nevertheless, it seems working without much trouble. The error message seems caused by + certificate verification code which introduced in etcd 3.2. There are multiple bug repports on + the issue. + +pods (failed pods, rogue namespaces, etc...) +==== + - After crashes / upgrades some pods may end up in 'Error' state. This is quite often happen to + * kube-service-catalog/controller-manager + * openshift-template-service-broker/api-server + Normally, they should be deleted. Then, OpenShift will auto-restart pods and they likely will run without problems. + for name in $(oc get pods -n openshift-template-service-broker | grep Error | awk '{ print $1 }' ); do oc -n openshift-template-service-broker delete po $name; done + for name in $(oc get pods -n kube-service-catalog | grep Error | awk '{ print $1 }' ); do oc -n kube-service-catalog delete po $name; done + + - Other pods will fail with 'ImagePullBackOff' after cluster crash. The problem is that ImageStreams populated by 'builds' will + not be recreated automatically. By default OpenShift docker registry is stored on ephemeral disks and is lost on crash. The build should be + re-executed manually. + oc -n adei start-build adei + + - Furthermore, after long outtages the CronJobs will stop functioning. The reason can be found by analyzing '/var/log/messages' or specially + systemctl status origin-master-controllers + it will contain something like: + 'Cannot determine if <namespace>/<cronjob> needs to be started: Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew.' + * The reason is that after 100 missed (or failed) launch periods it will stop trying to avoid excive load. The remedy is set 'startingDeadlineSeconds' + which tells the system that if cronJob has failed to start in the allocated interval we stop trying until the next start period. Then, 100 is only + counted the specified period. I.e. we should set period bellow the 'launch period / 100'. + https://github.com/kubernetes/kubernetes/issues/45825 + * The running CronJobs can be easily patched with + oc -n adei patch cronjob/adei-autogen-update --patch '{ "spec": {"startingDeadlineSeconds": 120 }}' + + - Sometimes there is rogue namespaces in 'deleting' state. This is also hundreds of reasons, but mainly + * Crash of both masters during population / destruction of OpenShift resources + * Running of 'oc adm diagnostics' + It is unclear how to remove them manually, but it seems if we run + * OpenShift upgrade, the namespaces are gone (but there could be a bunch of new problems). + * ... i don't know if install, etc. May cause the trouble... + + - There is also rogue pods (mainly due to some problems with unmounting lost storage), etc. If 'oc delete' does not + work for a long time. It worth + * Determining the host running failed pod with 'oc get pods -o wide' + * Going to the pod and killing processes and stopping the container using docker command + * Looking in the '/var/lib/origin/openshift.local.volumes/pods' for the remnants of the container + - This can be done with 'find . -name heketi*' or something like... + - There could be problematic mounts which can be freed with lazy umount + - The folders for removed pods may (and should) be removed. + + - Looking into the '/var/log/messages', it is sometimes possible to spot various erros like + * Orphaned pod "212074ca-1d15-11e8-9de3-525400225b53" found, but volume paths are still present on disk. + The volumes can be removed in '/var/lib/origin/openshift.local.volumes/pods' on the corresponding node + * PodSandbox "aa28e9c7605cae088838bb4c9b92172083680880cd4c085d93cbc33b5b9e8910" from runtime service failed: ... + - We can find and remove the corresponding container (the short id is just first letters of the long id) + docker ps -a | grep aa28e9c76 + docker rm <id> + - We further can just destroy all containers which are not running (it will actually try to remove all, + but just error message will be printed for running ones) + docker ps -aq --no-trunc | xargs docker rm + + +Storage +======= + - Running a lot of pods may exhaust available storage. It worth checking if + * There is enough Docker storage for containers (lvm) + * There is enough Heketi storage for dynamic volumes (lvm) + * The root file system on nodes still has space for logs, etc. + Particularly there is a big problem for ansible-ran virtual machines. The system disk is stored + under '/root/VirtualBox VMs' and is not cleaned/destroyed unlike second hard drive on 'vagrant + destroy'. So, it should be cleaned manually. + + - Problems with pvc's can be evaluated by running + oc -n openshift-ansible-service-broker describe pvc etcd + Furthermore it worth looking in the folder with volume logs. For each 'pv' it stores subdirectories + with pods executed on this host which are mount this pod and holds the log for this pods. + /var/lib/origin/plugins/kubernetes.io/glusterfs/ + + - Heketi is problematic. + * Worth checking if topology is fine and running. + heketi-cli -s http://heketi-storage-glusterfs.openshift.suren.me --user admin --secret "$(oc get secret heketi-storage-admin-secret -n glusterfs -o jsonpath='{.data.key}' | base64 -d)" + - Furthermore, the heketi gluster volumes may be started, but with multiple bricks offline. This can + be checked with + gluster volume status <vol> detail + * If not all bricks online, likely it is just enought to restart the volume + gluster volume stop <vol> + gluster volume start <vol> + * This may break services depending on provisioned 'pv' like 'openshift-ansible-service-broker/asb-etcd' + |