Second revision: includes hostpath mounts, gluster block storage, kaas apps, etc.

author: Suren A. Chilingaryan <csa@suren.me> 2018-03-18 22:59:31 +0100
committer: Suren A. Chilingaryan <csa@suren.me> 2018-03-18 22:59:31 +0100
commit: 47f350bc3aa85a8bd406d95faf084df2abf74ae9 (patch)
tree: 72ad1e91bac46d3457f89781dc90f0d6c1c074d5 /docs/troubleshooting.txt
parent: 006f333828db373435daa15483d2ab753048f62a (diff)
download: ands-47f350bc3aa85a8bd406d95faf084df2abf74ae9.tar.gz
ands-47f350bc3aa85a8bd406d95faf084df2abf74ae9.tar.bz2
ands-47f350bc3aa85a8bd406d95faf084df2abf74ae9.tar.xz
ands-47f350bc3aa85a8bd406d95faf084df2abf74ae9.zip
1 files changed, 49 insertions, 0 deletions
diff --git a/docs/troubleshooting.txt b/docs/troubleshooting.txt
index b4ac8e7..ef3c206 100644
--- a/docs/troubleshooting.txt
+++ b/docs/troubleshooting.txt
@@ -60,6 +60,8 @@ Debugging
         oc logs <pod name> --tail=100 [-p]                  - dc/name or ds/name as well
  - Verify initialization steps (check if all volumes are mounted)
         oc describe <pod name>
+ - Security (SCC) problems are visible if replica controller is queried
+        oc -n adei get  rc/mysql-1 -o yaml
  - It worth looking the pod environment
         oc env po <pod name> --list
  - It worth connecting running container with 'rsh' session and see running processes,
@@ -85,6 +87,7 @@ network
     * that nameserver is pointing to the host itself (but not localhost, this is important
     to allow running pods to use it)
     * that correct upstream nameservers are listed in '/etc/dnsmasq.d/origin-upstream-dns.conf'
+    * that correct upstream nameservers are listed in '/etc/origin/node/resolv.conf'
     * In some cases, it was necessary to restart dnsmasq (but it could be also for different reasons)
  If script misbehaves, it is possible to call it manually like that
     DEVICE_IFACE="eth1" ./99-origin-dns.sh eth1 up
@@ -96,6 +99,7 @@ etcd (and general operability)
  may be needed to restart them manually. I have noticed it with 
     * lvm2-lvmetad.socket       (pvscan will complain on problems)
     * node-origin
+    * glusterd in container     (just kill the misbehaving pod, it will be recreated)
     * etcd               but BEWARE of too entusiastic restarting:
  - However, restarting etcd many times is BAD as it may trigger a severe problem with 
  'kube-service-catalog/apiserver'. The bug description is here
@@ -181,6 +185,13 @@ pods (failed pods, rogue namespaces, etc...)
                 docker ps -aq --no-trunc | xargs docker rm
 
 
+Builds
+======
+ - After changing storage for integrated docker registry, it may refuse builds with HTTP error 500. It is necessary
+ to run:
+    oadm policy reconcile-cluster-roles
+
+
 Storage
 =======
  - Running a lot of pods may exhaust available storage. It worth checking if 
@@ -208,3 +219,41 @@ Storage
         gluster volume start <vol>
     * This may break services depending on provisioned 'pv' like 'openshift-ansible-service-broker/asb-etcd'
     
+ - If something gone wrong, heketi may end-up creating a bunch of new volumes, corrupt database, and crash
+ refusing to start. Here is the recovery procedure.
+    * Sometimes, it is still possible to start by setting 'HEKETI_IGNORE_STALE_OPERATIONS' environmental
+    variable on the container.
+        oc -n glusterfs env dc  heketi-storage -e HEKETI_IGNORE_STALE_OPERATIONS=true
+    * Even if it works, it does not solve the main issue with corruption. It is necessary to start a 
+    debugging pod for heketi (oc debug) export corrupted databased, fix it, and save back. Having
+    database backup could save a lot of hussle to find that is amiss.
+        heketi db export --dbfile heketi.db --jsonfile /tmp/q.json
+        oc cp glusterfs/heketi-storage-3-jqlwm-debug:/tmp/q.json q.json
+        cat q.json | python -m json.tool > q2.json
+        ...... Fixing .....
+        oc cp q2.json glusterfs/heketi-storage-3-jqlwm-debug:/tmp/q2.json 
+        heketi db import --dbfile heketi2.db --jsonfile /tmp/q2.json
+        cp heketi2.db /var/lib/heketi/heketi.db
+    * If bunch of disks is created, there are still various left-overs. First, the Gluster volumes
+    have to be cleaned. The idea is to compare 'vol_' prefixed volumes in Heketi and Gluster. And
+    remove ones not present in heketi. There is the script in 'ands/scripts'.
+    * There is LVM volumes left from Gluster (or even allocated, but not associated with Gluster for
+    various failurs. so this clean-up is worth making independently). On each node we can easily find
+    volumes created today
+        lvdisplay -o name,time -S 'time since "2018-03-16"'
+    or again we can compare lvm volumes which are used by Gluster bricks and which are not. The later
+    ones should be cleaned up. Again there is the script.
+     
+Performance
+===========
+ - To find if OpenShift restricts the usage of system resources, we can 'rsh' to container and check
+ cgroup limits in sysfs
+    /sys/fs/cgroup/cpuset/cpuset.cpus
+    /sys/fs/cgroup/memory/memory.limit_in_bytes
+
+
+Various
+=======
+ - IPMI may cause problems as well. Particularly, the mounted CDrom may start complaining. Easiest is
+ just to remove it from the running system with
+     echo 1 > /sys/block/sdd/device/delete
author	Suren A. Chilingaryan <csa@suren.me>	2018-03-18 22:59:31 +0100
committer	Suren A. Chilingaryan <csa@suren.me>	2018-03-18 22:59:31 +0100
commit	47f350bc3aa85a8bd406d95faf084df2abf74ae9 (patch)
tree	72ad1e91bac46d3457f89781dc90f0d6c1c074d5 /docs/troubleshooting.txt
parent	006f333828db373435daa15483d2ab753048f62a (diff)
download	ands-47f350bc3aa85a8bd406d95faf084df2abf74ae9.tar.gz ands-47f350bc3aa85a8bd406d95faf084df2abf74ae9.tar.bz2 ands-47f350bc3aa85a8bd406d95faf084df2abf74ae9.tar.xz ands-47f350bc3aa85a8bd406d95faf084df2abf74ae9.zip