summaryrefslogtreecommitdiffstats
path: root/docs/problems.txt
diff options
context:
space:
mode:
authorSuren A. Chilingaryan <csa@suren.me>2018-07-05 06:29:09 +0200
committerSuren A. Chilingaryan <csa@suren.me>2018-07-05 06:29:09 +0200
commit2c3f1522274c09f7cfdb6309adc0719f05c188e9 (patch)
treee54e0c26f581543f48e945f186734e4bd9a8f15a /docs/problems.txt
parent8af0865a3a3ef783b36016c17598adc9d932981d (diff)
downloadands-2c3f1522274c09f7cfdb6309adc0719f05c188e9.tar.gz
ands-2c3f1522274c09f7cfdb6309adc0719f05c188e9.tar.bz2
ands-2c3f1522274c09f7cfdb6309adc0719f05c188e9.tar.xz
ands-2c3f1522274c09f7cfdb6309adc0719f05c188e9.zip
Update monitoring scripts to track leftover OpenVSwitch 'veth' interfaces and clean them up pereodically to avoid performance degradation, split kickstart
Diffstat (limited to 'docs/problems.txt')
-rw-r--r--docs/problems.txt103
1 files changed, 103 insertions, 0 deletions
diff --git a/docs/problems.txt b/docs/problems.txt
new file mode 100644
index 0000000..4be9dc7
--- /dev/null
+++ b/docs/problems.txt
@@ -0,0 +1,103 @@
+Actions Required
+================
+ * Long-term solution to 'rogue' interfaces is unclear. May require update to OpenShift 3.9 or later.
+ However, proposed work-around should do unless execution rate grows significantly.
+ * All other problems found in logs can be ignored.
+
+
+Rogue network interfaces on OpenVSwitch bridge
+==============================================
+ Sometimes OpenShift fails to clean-up after terminated pod properly. The actual reason is unclear.
+ * The issue is discussed here:
+ https://bugzilla.redhat.com/show_bug.cgi?id=1518684
+ * And can be determined by looking into:
+ ovs-vsctl show
+
+ Problems:
+ * As number of rogue interfaces grow, it start to have impact on performance. Operations with
+ ovs slows down and at some point the pods schedulled to the affected node fail to start due to
+ timeouts. This is indicated in 'oc describe' as: 'failed to create pod sandbox'
+
+ Cause:
+ * Unclear, but it seems periodic ADEI cron jobs causes the issue.
+ * Could be related to 'container kill failed' problem explained in the section bellow.
+ Cannot kill container ###: rpc error: code = 2 desc = no such process
+
+
+ Solutions:
+ * According to RedHat the temporal solution is to reboot affected node (not tested yet). The problem
+ should go away, but may re-apper after a while.
+ * The simplest work-around is to just remove rogue interface. They will be re-created, but performance
+ problems only starts after hundreds accumulate.
+ ovs-vsctl del-port br0 <iface>
+
+ Status:
+ * Cron job is installed which cleans rogue interfaces as they number hits 25.
+
+
+Orphaning / pod termination problems in the logs
+================================================
+ There is several classes of problems reported with unknow reprecursions in the system log. Currently, I
+ don't see any negative side effects except some of these issues may trigger "rogue interfaces" problem.
+
+ ! container kill failed because of 'container not found' or 'no such process': Cannot kill container ###: rpc error: code = 2 desc = no such process"
+
+ Despite the errror, the containers are actually killed and pods destroyed. However, this error likely triggers
+ problem with rogue interfaces staying on the OpenVSwitch bridge.
+
+ Scenario:
+ * happens with short-living containers
+
+ - containerd: unable to save f7c3e6c02cdbb951670bc7ff925ddd7efd75a3bb5ed60669d4b182e5337dec23:d5b9394468235f7c9caca8ad4d97e7064cc49cd59cadd155eceae84545dc472a starttime: read /proc/81994/stat: no such process
+ containerd: f7c3e6c02cdbb951670bc7ff925ddd7efd75a3bb5ed60669d4b182e5337dec23:d5b9394468235f7c9caca8ad4d97e7064cc49cd59cadd155eceae84545dc472a (pid 81994) has become an orphan, killing it
+
+ Scenario:
+ This happens every couple of minutes and attributed to perfectely alive and running pods.
+ * For instance, ipekatrin1 was complaining some ADEI pod.
+ * After I removed this pod, it immidiately started complaining on 'glusterfs' replica.
+ * If 'glusterfs' pod re-created, the problem persist.
+ * It seems only a single pod is affected at each given moment (at least this was always true
+ on ipekatrin1 & ipekatrin2 while I was researching the problem)
+
+ Relations:
+ * This problem is not aligned with the next 'container not found' problem. One happens with short-living containers which
+ actually get destroyed. This one is triggered for persistent container which keep going. And in fact this problem is triggered
+ significantly more frequently.
+
+ Cause:
+ * Seems related to docker health checks due to a bug in docker 1.12* which is resolved in 1.13.0rc2
+ https://github.com/moby/moby/issues/28336
+
+ Problems:
+ * It seems only extensive logging, according to the discussion in the issue
+
+ Solution: Ignore for now
+ * docker-1.13 had some problems with groups (I don't remember exactly) and it was decided to not run it with current version of KaaS.
+ * Only update docker after extensive testing on the development cluster or not at all.
+
+ - W0625 03:49:34.231471 36511 docker_sandbox.go:337] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "...": Unexpected command output nsenter: cannot open /proc/63586/ns/net: No such file or directory
+ - W0630 21:40:20.978177 5552 docker_sandbox.go:337] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "...": CNI failed to retrieve network namespace path: Cannot find network namespace for the terminated container "..."
+ Scenario:
+ * It seems can be ignored, see RH bug.
+ * Happens with short-living containers (adei cron jobs)
+
+ Relations:
+ * This is also not aligned with 'container not found'. The time in logs differ significantly.
+ * It is also not aligned with 'orphan' problem.
+
+ Cause:
+ ? https://bugzilla.redhat.com/show_bug.cgi?id=1434950
+
+ - E0630 14:05:40.304042 5552 glusterfs.go:148] glusterfs: failed to get endpoints adei-cfg[an empty namespace may not be set when a resource name is provided]
+ E0630 14:05:40.304062 5552 reconciler.go:367] Could not construct volume information: MountVolume.NewMounter failed for volume "kubernetes.io/glusterfs/4
+
+ I guess some configuration issue.... Probably can be ignored...
+
+ Scenario:
+ * Reported on long running pods with persistent volumes (katrin, adai-db)
+ * Also seems an unrelated set of the problems.
+
+
+
+
+