From ba144fab071258a97cf3c42a0defeb0aae41a353 Mon Sep 17 00:00:00 2001 From: "Suren A. Chilingaryan" Date: Sun, 6 Oct 2019 05:00:55 +0200 Subject: Document latest problems with docker images and resource reclaimation, add docker performance checks in the monitoring scripts, helpers to filter the logs --- docs/consistency.txt | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) (limited to 'docs/consistency.txt') diff --git a/docs/consistency.txt b/docs/consistency.txt index 91a0ee7..3769a60 100644 --- a/docs/consistency.txt +++ b/docs/consistency.txt @@ -9,6 +9,10 @@ General overview oc get pvc --all-namespaces -o wide - API health check curl -k https://apiserver.kube-service-catalog.svc/healthz + - Docker status (at each node) + docker info + * Enough Data and Metadata Space is available + * The number of resident images is in check (>500-1000 - bad, >2000-3000 - terrible) Nodes ===== @@ -31,7 +35,7 @@ Storage Networking ========== - Check that correct upstream name servers are listed for both DNSMasq (host) and SkyDNS (pods). - If not fix and restart 'origin-node' and 'dnsmasq'. + If not fix and restart 'origin-node' and 'dnsmasq' (it happens that DNSMasq is just stuck). * '/etc/dnsmasq.d/origin-upstream-dns.conf' * '/etc/origin/node/resolv.conf' @@ -46,12 +50,14 @@ Networking - Ensure, we don't have override of cluster_name to first master (which we do during the provisioning of OpenShift plays) - - Sometimes OpenShift fails to clean-up after terminated pod properly. This causes rogue - network interfaces to remain in OpenVSwitch fabric. This can be determined by errors like: + - Sometimes OpenShift fails to clean-up after terminated pod properly (this problem is particularly + triggered on the systems with huge number of resident docker images). This causes rogue network + interfaces to remain in OpenVSwitch fabric. This can be determined by errors like: could not open network device vethb9de241f (No such device) reported by 'ovs-vsctl show' or present in the log '/var/log/openvswitch/ovs-vswitchd.log' which may quickly grow over 100MB quickly. If number of rogue interfaces grows too much, - the pod scheduling will start time-out on the affected node. + the pod scheduling gets even worse (compared to delays caused only be docker images) and + will start time-out on the affected node. * The work-around is to delete rogue interfaces with ovs-vsctl del-port br0 This does not solve the problem, however. The new interfaces will get abandoned by OpenShift. -- cgit v1.2.3