Update monitoring scripts to track leftover OpenVSwitch 'veth' interfaces and clean them up pereodically to avoid performance degradation, split kickstart

author: Suren A. Chilingaryan <csa@suren.me> 2018-07-05 06:29:09 +0200
committer: Suren A. Chilingaryan <csa@suren.me> 2018-07-05 06:29:09 +0200
commit: 2c3f1522274c09f7cfdb6309adc0719f05c188e9 (patch)
tree: e54e0c26f581543f48e945f186734e4bd9a8f15a /docs/troubleshooting.txt
parent: 8af0865a3a3ef783b36016c17598adc9d932981d (diff)
download: ands-2c3f1522274c09f7cfdb6309adc0719f05c188e9.tar.gz
ands-2c3f1522274c09f7cfdb6309adc0719f05c188e9.tar.bz2
ands-2c3f1522274c09f7cfdb6309adc0719f05c188e9.tar.xz
ands-2c3f1522274c09f7cfdb6309adc0719f05c188e9.zip
1 files changed, 18 insertions, 0 deletions
diff --git a/docs/troubleshooting.txt b/docs/troubleshooting.txt
index ae43c52..9fa6f91 100644
--- a/docs/troubleshooting.txt
+++ b/docs/troubleshooting.txt
@@ -134,6 +134,22 @@ etcd (and general operability)
  
 pods (failed pods, rogue namespaces, etc...)
 ====
+ - The 'pods' scheduling may fail on one (or more) of the nodes after long waiting with 'oc logs' reporting
+ timeout. The 'oc describe' reports 'failed to create pod sandbox'. This can be caused by failure to clean-up 
+ after terminated pod properly. It causes rogue network interfaces to remain in OpenVSwitch fabric. 
+  * This can be determined by errors reported using 'ovs-vsctl show' or present in the log '/var/log/openvswitch/ovs-vswitchd.log' 
+    which may quickly grow over 100MB quickly. 
+        could not open network device vethb9de241f (No such device)
+  * The work-around is to delete rogue interfaces with 
+        ovs-vsctl del-port br0 <iface>
+    More info:
+        ovs-ofctl -O OpenFlow13 show br0
+        ovs-ofctl -O OpenFlow13 dump-flows br0
+    This does not solve the problem, however. The new interfaces will get abandoned by OpenShift.
+  * The issue is discussed here:
+        https://bugzilla.redhat.com/show_bug.cgi?id=1518684
+        https://bugzilla.redhat.com/show_bug.cgi?id=1518912
+        
  - After crashes / upgrades some pods may end up in 'Error' state. This is quite often happen to
     * kube-service-catalog/controller-manager
     * openshift-template-service-broker/api-server
@@ -185,6 +201,8 @@ pods (failed pods, rogue namespaces, etc...)
                 docker ps -aq --no-trunc | xargs docker rm
 
 
+
+
 Builds
 ======
  - After changing storage for integrated docker registry, it may refuse builds with HTTP error 500. It is necessary
author	Suren A. Chilingaryan <csa@suren.me>	2018-07-05 06:29:09 +0200
committer	Suren A. Chilingaryan <csa@suren.me>	2018-07-05 06:29:09 +0200
commit	2c3f1522274c09f7cfdb6309adc0719f05c188e9 (patch)
tree	e54e0c26f581543f48e945f186734e4bd9a8f15a /docs/troubleshooting.txt
parent	8af0865a3a3ef783b36016c17598adc9d932981d (diff)
download	ands-2c3f1522274c09f7cfdb6309adc0719f05c188e9.tar.gz ands-2c3f1522274c09f7cfdb6309adc0719f05c188e9.tar.bz2 ands-2c3f1522274c09f7cfdb6309adc0719f05c188e9.tar.xz ands-2c3f1522274c09f7cfdb6309adc0719f05c188e9.zip