summaryrefslogtreecommitdiffstats
path: root/docs/troubleshooting.txt
diff options
context:
space:
mode:
authorSuren A. Chilingaryan <csa@suren.me>2018-03-11 19:56:38 +0100
committerSuren A. Chilingaryan <csa@suren.me>2018-03-11 19:56:38 +0100
commitf3c41dd13a0a86382b80d564e9de0d6b06fb1dbf (patch)
tree3522ce77203da92bb2b6f7cfa2b0999bf6cc132c /docs/troubleshooting.txt
parent6bc3a3ac71e11fb6459df715536fec373c123a97 (diff)
downloadands-f3c41dd13a0a86382b80d564e9de0d6b06fb1dbf.tar.gz
ands-f3c41dd13a0a86382b80d564e9de0d6b06fb1dbf.tar.bz2
ands-f3c41dd13a0a86382b80d564e9de0d6b06fb1dbf.tar.xz
ands-f3c41dd13a0a86382b80d564e9de0d6b06fb1dbf.zip
Various fixes before moving to hardware installation
Diffstat (limited to 'docs/troubleshooting.txt')
-rw-r--r--docs/troubleshooting.txt210
1 files changed, 210 insertions, 0 deletions
diff --git a/docs/troubleshooting.txt b/docs/troubleshooting.txt
new file mode 100644
index 0000000..b4ac8e7
--- /dev/null
+++ b/docs/troubleshooting.txt
@@ -0,0 +1,210 @@
+The services has to be running
+------------------------------
+ Etcd:
+ - etcd
+
+ Node:
+ - origin-node
+
+ Master nodes:
+ - origin-master-api
+ - origin-master-controllers
+ - origin-master is not running
+
+ Required Services:
+ - lvm2-lvmetad.socket
+ - lvm2-lvmetad.service
+ - docker
+ - NetworkManager
+ - firewalld
+ - dnsmasq
+ - openvswitch
+
+ Extra Services:
+ - ssh
+ - ntp
+ - openvpn
+ - ganesha (on master nodes, optional)
+
+Pods has to be running
+----------------------
+ Kubernetes System
+ - kube-service-catalog/apiserver
+ - kube-service-catalog/controller-manager
+
+ OpenShift Main Services
+ - default/docker-registry
+ - default/registry-console
+ - default/router (3 replicas)
+ - openshift-template-service-broker/api-server (daemonset, on all nodes)
+
+ OpenShift Secondary Services
+ - openshift-ansible-service-broker/asb
+ - openshift-ansible-service-broker/asb-etcd
+
+ GlusterFS
+ - glusterfs-storage (daemonset, on all storage nodes)
+ - glusterblock-storage-provisioner-dc
+ - heketi-storage
+
+ Metrics (openshift-infra):
+ - hawkular-cassandra
+ - hawkular-metrics
+ - heapster
+
+
+Debugging
+=========
+ - Ensure system consistency as explained in 'consistency.txt' (incomplete)
+ - Check current pod logs and possibly logs for last failed instance
+ oc logs <pod name> --tail=100 [-p] - dc/name or ds/name as well
+ - Verify initialization steps (check if all volumes are mounted)
+ oc describe <pod name>
+ - It worth looking the pod environment
+ oc env po <pod name> --list
+ - It worth connecting running container with 'rsh' session and see running processes,
+ internal logs, etc. The 'debug' session will start a new instance of the pod.
+ - If try looking if corresponding pv/pvc are bound. Check logs for pv.
+ * Even if 'pvc' is bound. The 'pv' may have problems with its backend.
+ * Check logs here: /var/lib/origin/plugins/kubernetes.io/glusterfs/
+ - Another frequent problems is failing 'postStart' hook. Or 'livenessProbe'. As it
+ immediately crashes it is not possible to connect. Remedies are:
+ * Set larger initial delay to check the probe.
+ * Try to remove hook and execute it using 'rsh'/'debug'
+ - Determine node running the pod and check the host logs in '/var/log/messages'
+ * Particularly logs of 'origin-master-controllers' are of interest
+ - Check which docker images are actually downloaded on the node
+ docker images
+
+network
+=======
+ - There is a NetworkManager script which should adjust /etc/resolv.conf to use local dnsmasq server.
+ This is based on '/etc/NetworkManager/dispatcher.d/99-origin-dns.sh' which does not play well
+ if OpenShift is running on non-default network interface. I provided a patched version, but it
+ worth verifying
+ * that nameserver is pointing to the host itself (but not localhost, this is important
+ to allow running pods to use it)
+ * that correct upstream nameservers are listed in '/etc/dnsmasq.d/origin-upstream-dns.conf'
+ * In some cases, it was necessary to restart dnsmasq (but it could be also for different reasons)
+ If script misbehaves, it is possible to call it manually like that
+ DEVICE_IFACE="eth1" ./99-origin-dns.sh eth1 up
+
+
+etcd (and general operability)
+====
+ - Few of this sevices may seem running accroding to 'systemctl', but actually misbehave. Then, it
+ may be needed to restart them manually. I have noticed it with
+ * lvm2-lvmetad.socket (pvscan will complain on problems)
+ * node-origin
+ * etcd but BEWARE of too entusiastic restarting:
+ - However, restarting etcd many times is BAD as it may trigger a severe problem with
+ 'kube-service-catalog/apiserver'. The bug description is here
+ https://github.com/kubernetes/kubernetes/issues/47131
+ - Due to problem mentioned above, all 'oc' queries are very slow. There is not proper
+ solution suggested. But killing the 'kube-service-catalog/apiserver' helps for a while.
+ The pod is restarted and response times are back in order.
+ * Another way to see this problem is quering 'healthz' service which would tell that
+ there is too many clients and, please, retry later.
+ curl -k https://apiserver.kube-service-catalog.svc/healthz
+
+ - On node crash, the etcd database may get corrupted.
+ * There is no easy fix. Backup/restore is not working.
+ * Easiest option is to remove the failed etcd from the cluster.
+ etcdctl3 --endpoints="192.168.213.1:2379" member list
+ etcdctl3 --endpoints="192.168.213.1:2379" member remove <hexid>
+ * Add it to [new_etcd] section in inventory and run openshift-etcd to scale-up etcd cluster.
+
+ - There is a helth check provided by the cluster
+ curl -k https://apiserver.kube-service-catalog.svc/healthz
+ it may complain about etcd problems. It seems triggered by OpenShift upgrade. The real cause and
+ remedy is unclear, but the installation is mostly working. Discussion is in docs/upgrade.txt
+
+ - There is also a different etcd which is integral part of the ansible service broker:
+ 'openshift-ansible-service-broker/asb-etcd'. If investigated with 'oc logs' it complains
+ on:
+ 2018-03-07 20:54:48.791735 I | embed: rejected connection from "127.0.0.1:43066" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority", ServerName "")
+ WARNING: 2018/03/07 20:54:48 Failed to dial 0.0.0.0:2379: connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate"; please retry.
+ Nevertheless, it seems working without much trouble. The error message seems caused by
+ certificate verification code which introduced in etcd 3.2. There are multiple bug repports on
+ the issue.
+
+pods (failed pods, rogue namespaces, etc...)
+====
+ - After crashes / upgrades some pods may end up in 'Error' state. This is quite often happen to
+ * kube-service-catalog/controller-manager
+ * openshift-template-service-broker/api-server
+ Normally, they should be deleted. Then, OpenShift will auto-restart pods and they likely will run without problems.
+ for name in $(oc get pods -n openshift-template-service-broker | grep Error | awk '{ print $1 }' ); do oc -n openshift-template-service-broker delete po $name; done
+ for name in $(oc get pods -n kube-service-catalog | grep Error | awk '{ print $1 }' ); do oc -n kube-service-catalog delete po $name; done
+
+ - Other pods will fail with 'ImagePullBackOff' after cluster crash. The problem is that ImageStreams populated by 'builds' will
+ not be recreated automatically. By default OpenShift docker registry is stored on ephemeral disks and is lost on crash. The build should be
+ re-executed manually.
+ oc -n adei start-build adei
+
+ - Furthermore, after long outtages the CronJobs will stop functioning. The reason can be found by analyzing '/var/log/messages' or specially
+ systemctl status origin-master-controllers
+ it will contain something like:
+ 'Cannot determine if <namespace>/<cronjob> needs to be started: Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew.'
+ * The reason is that after 100 missed (or failed) launch periods it will stop trying to avoid excive load. The remedy is set 'startingDeadlineSeconds'
+ which tells the system that if cronJob has failed to start in the allocated interval we stop trying until the next start period. Then, 100 is only
+ counted the specified period. I.e. we should set period bellow the 'launch period / 100'.
+ https://github.com/kubernetes/kubernetes/issues/45825
+ * The running CronJobs can be easily patched with
+ oc -n adei patch cronjob/adei-autogen-update --patch '{ "spec": {"startingDeadlineSeconds": 120 }}'
+
+ - Sometimes there is rogue namespaces in 'deleting' state. This is also hundreds of reasons, but mainly
+ * Crash of both masters during population / destruction of OpenShift resources
+ * Running of 'oc adm diagnostics'
+ It is unclear how to remove them manually, but it seems if we run
+ * OpenShift upgrade, the namespaces are gone (but there could be a bunch of new problems).
+ * ... i don't know if install, etc. May cause the trouble...
+
+ - There is also rogue pods (mainly due to some problems with unmounting lost storage), etc. If 'oc delete' does not
+ work for a long time. It worth
+ * Determining the host running failed pod with 'oc get pods -o wide'
+ * Going to the pod and killing processes and stopping the container using docker command
+ * Looking in the '/var/lib/origin/openshift.local.volumes/pods' for the remnants of the container
+ - This can be done with 'find . -name heketi*' or something like...
+ - There could be problematic mounts which can be freed with lazy umount
+ - The folders for removed pods may (and should) be removed.
+
+ - Looking into the '/var/log/messages', it is sometimes possible to spot various erros like
+ * Orphaned pod "212074ca-1d15-11e8-9de3-525400225b53" found, but volume paths are still present on disk.
+ The volumes can be removed in '/var/lib/origin/openshift.local.volumes/pods' on the corresponding node
+ * PodSandbox "aa28e9c7605cae088838bb4c9b92172083680880cd4c085d93cbc33b5b9e8910" from runtime service failed: ...
+ - We can find and remove the corresponding container (the short id is just first letters of the long id)
+ docker ps -a | grep aa28e9c76
+ docker rm <id>
+ - We further can just destroy all containers which are not running (it will actually try to remove all,
+ but just error message will be printed for running ones)
+ docker ps -aq --no-trunc | xargs docker rm
+
+
+Storage
+=======
+ - Running a lot of pods may exhaust available storage. It worth checking if
+ * There is enough Docker storage for containers (lvm)
+ * There is enough Heketi storage for dynamic volumes (lvm)
+ * The root file system on nodes still has space for logs, etc.
+ Particularly there is a big problem for ansible-ran virtual machines. The system disk is stored
+ under '/root/VirtualBox VMs' and is not cleaned/destroyed unlike second hard drive on 'vagrant
+ destroy'. So, it should be cleaned manually.
+
+ - Problems with pvc's can be evaluated by running
+ oc -n openshift-ansible-service-broker describe pvc etcd
+ Furthermore it worth looking in the folder with volume logs. For each 'pv' it stores subdirectories
+ with pods executed on this host which are mount this pod and holds the log for this pods.
+ /var/lib/origin/plugins/kubernetes.io/glusterfs/
+
+ - Heketi is problematic.
+ * Worth checking if topology is fine and running.
+ heketi-cli -s http://heketi-storage-glusterfs.openshift.suren.me --user admin --secret "$(oc get secret heketi-storage-admin-secret -n glusterfs -o jsonpath='{.data.key}' | base64 -d)"
+ - Furthermore, the heketi gluster volumes may be started, but with multiple bricks offline. This can
+ be checked with
+ gluster volume status <vol> detail
+ * If not all bricks online, likely it is just enought to restart the volume
+ gluster volume stop <vol>
+ gluster volume start <vol>
+ * This may break services depending on provisioned 'pv' like 'openshift-ansible-service-broker/asb-etcd'
+