From f3c41dd13a0a86382b80d564e9de0d6b06fb1dbf Mon Sep 17 00:00:00 2001
From: "Suren A. Chilingaryan" <csa@suren.me>
Date: Sun, 11 Mar 2018 19:56:38 +0100
Subject: Various fixes before moving to hardware installation

---
 docs/troubleshooting.txt | 210 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 210 insertions(+)
 create mode 100644 docs/troubleshooting.txt

(limited to 'docs/troubleshooting.txt')
diff --git a/docs/troubleshooting.txt b/docs/troubleshooting.txt
new file mode 100644
index 0000000..b4ac8e7
--- /dev/null
+++ b/docs/troubleshooting.txt
@@ -0,0 +1,210 @@
+The services has to be running
+------------------------------
+  Etcd:
+    - etcd 
+
+  Node:
+    - origin-node
+ 
+  Master nodes:
+    - origin-master-api
+    - origin-master-controllers
+    - origin-master is not running
+
+  Required Services:
+    - lvm2-lvmetad.socket 
+    - lvm2-lvmetad.service
+    - docker
+    - NetworkManager
+    - firewalld
+    - dnsmasq
+    - openvswitch
+ 
+  Extra Services:
+    - ssh
+    - ntp
+    - openvpn
+    - ganesha (on master nodes, optional)
+
+Pods has to be running
+----------------------
+  Kubernetes System
+    - kube-service-catalog/apiserver
+    - kube-service-catalog/controller-manager
+  
+  OpenShift Main Services
+    - default/docker-registry
+    - default/registry-console
+    - default/router (3 replicas)
+    - openshift-template-service-broker/api-server (daemonset, on all nodes)
+
+  OpenShift Secondary Services
+    - openshift-ansible-service-broker/asb
+    - openshift-ansible-service-broker/asb-etcd
+
+  GlusterFS
+     - glusterfs-storage (daemonset, on all storage nodes)
+     - glusterblock-storage-provisioner-dc
+     - heketi-storage
+
+  Metrics (openshift-infra):
+    - hawkular-cassandra
+    - hawkular-metrics
+    - heapster
+    
+
+Debugging
+=========
+ - Ensure system consistency as explained in 'consistency.txt' (incomplete)
+ - Check current pod logs and possibly logs for last failed instance
+        oc logs <pod name> --tail=100 [-p]                  - dc/name or ds/name as well
+ - Verify initialization steps (check if all volumes are mounted)
+        oc describe <pod name>
+ - It worth looking the pod environment
+        oc env po <pod name> --list
+ - It worth connecting running container with 'rsh' session and see running processes,
+ internal logs, etc. The 'debug' session will start a new instance of the pod.
+ - If try looking if corresponding pv/pvc are bound. Check logs for pv.
+    * Even if 'pvc' is bound. The 'pv' may have problems with its backend.
+    * Check logs here: /var/lib/origin/plugins/kubernetes.io/glusterfs/
+ - Another frequent problems is failing 'postStart' hook. Or 'livenessProbe'. As it
+ immediately crashes it is not possible to connect. Remedies are:
+    * Set larger initial delay to check the probe.
+    * Try to remove hook and execute it using 'rsh'/'debug'
+ - Determine node running the pod and check the host logs in '/var/log/messages'
+    * Particularly logs of 'origin-master-controllers' are of interest
+ - Check which docker images are actually downloaded on the node
+        docker images
+
+network
+=======
+ - There is a NetworkManager script which should adjust /etc/resolv.conf to use local dnsmasq server.
+ This is based on  '/etc/NetworkManager/dispatcher.d/99-origin-dns.sh' which does not play well 
+ if OpenShift is running on non-default network interface. I provided a patched version, but it
+ worth verifying 
+    * that nameserver is pointing to the host itself (but not localhost, this is important
+    to allow running pods to use it)
+    * that correct upstream nameservers are listed in '/etc/dnsmasq.d/origin-upstream-dns.conf'
+    * In some cases, it was necessary to restart dnsmasq (but it could be also for different reasons)
+ If script misbehaves, it is possible to call it manually like that
+    DEVICE_IFACE="eth1" ./99-origin-dns.sh eth1 up
+
+
+etcd (and general operability)
+====
+ - Few of this sevices may seem running accroding to 'systemctl', but actually misbehave. Then, it 
+ may be needed to restart them manually. I have noticed it with 
+    * lvm2-lvmetad.socket       (pvscan will complain on problems)
+    * node-origin
+    * etcd               but BEWARE of too entusiastic restarting:
+ - However, restarting etcd many times is BAD as it may trigger a severe problem with 
+ 'kube-service-catalog/apiserver'. The bug description is here
+        https://github.com/kubernetes/kubernetes/issues/47131
+ - Due to problem mentioned above, all 'oc' queries are very slow. There is not proper
+ solution suggested. But killing the 'kube-service-catalog/apiserver' helps for a while.
+ The pod is restarted and response times are back in order.
+    * Another way to see this problem is quering 'healthz' service which would tell that
+    there is too many clients and, please, retry later.
+        curl -k https://apiserver.kube-service-catalog.svc/healthz
+
+ - On node crash, the etcd database may get corrupted. 
+    * There is no easy fix. Backup/restore is not working.
+    * Easiest option is to remove the failed etcd from the cluster.
+        etcdctl3 --endpoints="192.168.213.1:2379" member list
+        etcdctl3 --endpoints="192.168.213.1:2379" member remove <hexid>
+    * Add it to [new_etcd] section in inventory and run openshift-etcd to scale-up etcd cluster.
+ 
+ - There is a helth check provided by the cluster
+    curl -k https://apiserver.kube-service-catalog.svc/healthz
+ it may complain about etcd problems. It seems triggered by OpenShift upgrade. The real cause and
+ remedy is unclear, but the installation is mostly working. Discussion is in docs/upgrade.txt
+ 
+ - There is also a different etcd which is integral part of the ansible service broker: 
+ 'openshift-ansible-service-broker/asb-etcd'. If investigated with 'oc logs' it complains 
+ on:
+        2018-03-07 20:54:48.791735 I | embed: rejected connection from "127.0.0.1:43066" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority", ServerName "")
+        WARNING: 2018/03/07 20:54:48 Failed to dial 0.0.0.0:2379: connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate"; please retry.
+ Nevertheless, it seems working without much trouble. The error message seems caused by
+ certificate verification code which introduced in etcd 3.2. There are multiple bug repports on
+ the issue.
+ 
+pods (failed pods, rogue namespaces, etc...)
+====
+ - After crashes / upgrades some pods may end up in 'Error' state. This is quite often happen to
+    * kube-service-catalog/controller-manager
+    * openshift-template-service-broker/api-server
+ Normally, they should be deleted. Then, OpenShift will auto-restart pods and they likely will run without problems.
+    for name in  $(oc get pods -n openshift-template-service-broker | grep Error | awk '{ print $1 }' ); do oc -n openshift-template-service-broker delete po $name; done
+    for name in  $(oc get pods -n kube-service-catalog | grep Error | awk '{ print $1 }' ); do oc -n kube-service-catalog delete po $name; done 
+ 
+ - Other pods will fail with 'ImagePullBackOff' after cluster crash. The problem is that ImageStreams populated by 'builds' will 
+ not be recreated automatically. By default OpenShift docker registry is stored on ephemeral disks and is lost on crash. The build should be 
+ re-executed manually.
+        oc -n adei start-build adei
+
+ - Furthermore, after long outtages the CronJobs will stop functioning. The reason can be found by analyzing '/var/log/messages' or specially
+        systemctl status origin-master-controllers
+  it will contain something like:
+        'Cannot determine if <namespace>/<cronjob> needs to be started: Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew.'
+  * The reason is that after 100 missed (or failed) launch periods it will stop trying to avoid excive load. The remedy is set 'startingDeadlineSeconds'
+  which tells the system that if cronJob has failed to start in the allocated interval we stop trying until the next start period. Then, 100 is only 
+  counted the specified period. I.e. we should set period bellow the 'launch period / 100'.
+        https://github.com/kubernetes/kubernetes/issues/45825
+  * The running CronJobs can be easily patched with
+        oc -n adei patch cronjob/adei-autogen-update --patch '{ "spec": {"startingDeadlineSeconds": 120 }}'
+  
+ - Sometimes there is rogue namespaces in 'deleting' state. This is also hundreds of reasons, but mainly
+    * Crash of both masters during population / destruction of OpenShift resources
+    * Running of 'oc adm diagnostics'
+  It is unclear how to remove them manually, but it seems if we run
+    * OpenShift upgrade, the namespaces are gone (but there could be a bunch of new problems).
+    * ... i don't know if install, etc. May cause the trouble...
+
+ - There is also rogue pods (mainly due to some problems with unmounting lost storage), etc. If 'oc delete' does not
+ work for a long time. It worth
+    * Determining the host running failed pod with 'oc get pods -o wide'
+    * Going to the pod and killing processes and stopping the container using docker command
+    * Looking in the '/var/lib/origin/openshift.local.volumes/pods' for the remnants of the container
+        - This can be done with 'find . -name heketi*' or something like...
+        - There could be problematic mounts which can be freed with lazy umount
+        - The folders for removed pods may (and should) be removed.
+
+ - Looking into the '/var/log/messages', it is sometimes possible to spot various erros like
+    * Orphaned pod "212074ca-1d15-11e8-9de3-525400225b53" found, but volume paths are still present on disk.
+        The volumes can be removed in '/var/lib/origin/openshift.local.volumes/pods' on the corresponding node
+    * PodSandbox "aa28e9c7605cae088838bb4c9b92172083680880cd4c085d93cbc33b5b9e8910" from runtime service failed: ...
+        - We can find and remove the corresponding container (the short id is just first letters of the long id)
+                docker ps -a | grep aa28e9c76
+                docker rm <id>
+        - We further can just destroy all containers which are not running (it will actually try to remove all,
+        but just error message will be printed for running ones)
+                docker ps -aq --no-trunc | xargs docker rm
+
+
+Storage
+=======
+ - Running a lot of pods may exhaust available storage. It worth checking if 
+    * There is enough Docker storage for containers (lvm)
+    * There is enough Heketi storage for dynamic volumes (lvm)
+    * The root file system on nodes still has space for logs, etc. 
+  Particularly there is a big problem for ansible-ran virtual machines. The system disk is stored
+  under '/root/VirtualBox VMs' and is not cleaned/destroyed unlike second hard drive on 'vagrant
+  destroy'. So, it should be cleaned manually.
+  
+ - Problems with pvc's can be evaluated by running 
+        oc  -n openshift-ansible-service-broker describe pvc etcd
+   Furthermore it worth looking in the folder with volume logs. For each 'pv' it stores subdirectories
+   with pods executed on this host which are mount this pod and holds the log for this pods.
+        /var/lib/origin/plugins/kubernetes.io/glusterfs/
+
+ - Heketi is problematic.
+    * Worth checking if topology is fine and running.
+        heketi-cli -s http://heketi-storage-glusterfs.openshift.suren.me --user admin --secret "$(oc get secret heketi-storage-admin-secret -n glusterfs -o jsonpath='{.data.key}' | base64 -d)"
+ - Furthermore, the heketi gluster volumes may be started, but with multiple bricks offline. This can
+ be checked with 
+        gluster volume status <vol> detail
+    * If not all bricks online, likely it is just enought to restart the volume
+        gluster volume stop <vol>
+        gluster volume start <vol>
+    * This may break services depending on provisioned 'pv' like 'openshift-ansible-service-broker/asb-etcd'
+    
-- 
cgit v1.2.3