Document another problem with lost IPs and exhausting of SDN IP range

author: Suren A. Chilingaryan <csa@suren.me> 2020-01-22 03:16:06 +0100
committer: Suren A. Chilingaryan <csa@suren.me> 2020-01-22 03:16:06 +0100
commit: 1e8153c2af051ce48d5aa08d3dbdc0d0970ea532 (patch)
tree: 7bb1441a87521aa8c3c5524f95fa645850a6826e /docs/problems.txt
parent: e0b1b53f21095707af87a095934e971d788a90c7 (diff)
download: ands-1e8153c2af051ce48d5aa08d3dbdc0d0970ea532.tar.gz
ands-1e8153c2af051ce48d5aa08d3dbdc0d0970ea532.tar.bz2
ands-1e8153c2af051ce48d5aa08d3dbdc0d0970ea532.tar.xz
ands-1e8153c2af051ce48d5aa08d3dbdc0d0970ea532.zip
1 files changed, 14 insertions, 6 deletions
diff --git a/docs/problems.txt b/docs/problems.txt
index 099193a..3b652ec 100644
--- a/docs/problems.txt
+++ b/docs/problems.txt
@@ -13,13 +13,14 @@ Client Connection
    box pops up.
 
 
-Rogue network interfaces on OpenVSwitch bridge
-==============================================
+Leaked resourced after node termination: Rogue network interfaces on OpenVSwitch bridge, unreclaimed IPs in pod-network, ...
+=======================================
  Sometimes OpenShift fails to clean-up after terminated pod properly. The actual reason is unclear, but
  severity of the problem is increased if extreme amount of images is presented in local Docker storage.
  Several thousands is defenitively intensifies this problem.
-  * The issue is discussed here:
+  * The issues are discussed here:
         https://bugzilla.redhat.com/show_bug.cgi?id=1518684
+        https://bugzilla.redhat.com/show_bug.cgi?id=1518912
   * And can be determined by looking into:
     ovs-vsctl show
 
@@ -30,6 +31,12 @@ Rogue network interfaces on OpenVSwitch bridge
   * With time, the new rogue interfaces are created faster and faster. At some point, it really
   slow downs system and causes pod failures (if many pods are re-scheduled in paralllel) even 
   if not so many rogue interfaces still present
+  * Furthermore, there is a limit range of IPs allocated for pod-network at each node. Whatever 
+  it is caused by tje lost bridges or it is an unrellated resource-management problem in OpenShift,
+  but this IPs also start to leak. As number of leaked IPs increase, it gets longer for OpenShift
+  to find IP which is still free and pod schedulling slows down further. At some point, the complete
+  range of IPs will get exhausted and pods will fail to start (after long waiting in Scheduling state)
+  on the affected node.
   * Even if not failed, it takes several minutes to schedule the pod on the affected nodes.
 
  Cause:
@@ -38,7 +45,6 @@ Rogue network interfaces on OpenVSwitch bridge
   * Could be related to 'container kill failed' problem explained in the section bellow.
      Cannot kill container ###: rpc error: code = 2 desc = no such process
 
-         
  Solutions:
   * According to RedHat the temporal solution is to reboot affected node (just temporarily reduces the rate how 
   often the new spurious interfaces appear, but not preventing the problem completely in my case). The problem
@@ -46,8 +52,10 @@ Rogue network interfaces on OpenVSwitch bridge
   * The simplest work-around is to just remove rogue interface. They will be re-created, but performance
   problems only starts after hundreds accumulate.
     ovs-vsctl del-port br0 <iface>
-  * It seems helpful to purge unused docker images to reduce the rate of interface apperance.
-  
+  * Similarly, the unused IPs could be cleaned in "/var/lib/cni/networks/openshift-sdn", just check if docker 
+  image referenced in each IP file is still running with "docker ps". Afterwards, the 'orgin-node' service
+  should be restarted.
+  * It seems also helpful to purge unused docker images to reduce the rate of interface apperance.
   
  Status:
    * Cron job is installed which cleans rogue interfaces as they number hits 25.
author	Suren A. Chilingaryan <csa@suren.me>	2020-01-22 03:16:06 +0100
committer	Suren A. Chilingaryan <csa@suren.me>	2020-01-22 03:16:06 +0100
commit	1e8153c2af051ce48d5aa08d3dbdc0d0970ea532 (patch)
tree	7bb1441a87521aa8c3c5524f95fa645850a6826e /docs/problems.txt
parent	e0b1b53f21095707af87a095934e971d788a90c7 (diff)
download	ands-1e8153c2af051ce48d5aa08d3dbdc0d0970ea532.tar.gz ands-1e8153c2af051ce48d5aa08d3dbdc0d0970ea532.tar.bz2 ands-1e8153c2af051ce48d5aa08d3dbdc0d0970ea532.tar.xz ands-1e8153c2af051ce48d5aa08d3dbdc0d0970ea532.zip