Document latest problems with docker images and resource reclaimation, add docker performance checks in the monitoring scripts, helpers to filter the logs

author: Suren A. Chilingaryan <csa@suren.me> 2019-10-06 05:00:55 +0200
committer: Suren A. Chilingaryan <csa@suren.me> 2019-10-06 05:00:55 +0200
commit: ba144fab071258a97cf3c42a0defeb0aae41a353 (patch)
tree: 2e738d4e4774d754b56d79021cc8781b3c0835a5 /docs/maintenance.txt
parent: efe4b9bbe3c9cb950378de9697eed2030ac49ca2 (diff)
download: ands-ba144fab071258a97cf3c42a0defeb0aae41a353.tar.gz
ands-ba144fab071258a97cf3c42a0defeb0aae41a353.tar.bz2
ands-ba144fab071258a97cf3c42a0defeb0aae41a353.tar.xz
ands-ba144fab071258a97cf3c42a0defeb0aae41a353.zip
1 files changed, 55 insertions, 0 deletions
diff --git a/docs/maintenance.txt b/docs/maintenance.txt
new file mode 100644
index 0000000..9f52e18
--- /dev/null
+++ b/docs/maintenance.txt
@@ -0,0 +1,55 @@
+Unused resources
+================
+ ! Cleaning of images is necessary if amount of resident images grow above 1000. Everything else has not caused problems yet and could
+ be ignored unless blocking other actions (e.g. clean-up of old images)
+
+ - Deployments. As is this hasn't caused problems yet, but old versions of 'rc' may block removal of the old images and this may
+ have negative impact on performance.
+        oc adm prune deployments --orphans --keep-complete=3 --keep-failed=1 --keep-younger-than=60m --confirm
+        oc adm prune builds --orphans --keep-complete=3 --keep-failed=1 --keep-younger-than=60m --confirm
+    * This is, however, does not clean old 'rc' controllers which are allowed by 'revisionHistoryLimit' (and may be something else as
+    well). There is a script included to clean such controllers 'prunerc.sh'
+
+ - OpenShift sometimes fails to clean stopped containers. This containers again may block removal of images (and likely on itself also
+ can use Docker performance penalties if accumulated).
+    * The lost containers can be identified by looking into the /var/log/messages. 
+        PodSandbox "aa28e9c7605cae088838bb4c9b92172083680880cd4c085d93cbc33b5b9e8910" from runtime service failed: ...
+    *  We can find and remove the corresponding container (the short id is just first letters of the long id)
+        docker ps -a | grep aa28e9c76
+        docker rm <id>
+    * But in general any not-running container which is for a long time remains in stopped state could be considered lost. We can remove
+    all of them or just ones related to the specific image (if we are cleaning images and something blocks deletion of an old version)
+        docker rm $(docker ps -a | grep Exited | grep adei | awk '{ print $1 }')
+
+ - If cleaning containers manually or/and forcing termination of pods, some remnants could be left in '/var/lib/origin/openshift.local.volumes/pods' 
+    * Probably, it is also could happen in other cases. This can be detected by looking in /var/log/messages for something like
+            Orphaned pod "212074ca-1d15-11e8-9de3-525400225b53" found, but volume paths are still present on disk.
+    * If unknown, the location for the pod in question could be found with 'find . -name heketi*' or something like (the containers names will be listed 
+    under this subdirectory, so they can be used in search)...
+    * There could be problematic mounts which can be freed with lazy umount
+    * The folders for removed pods may (and should) be removed.
+
+ - Prunning unused images (this is required as if large amount is accumulated, the additional latencies in communication with docker
+ daemon will be inrtoduced and result in severe penalties to scheduling performance). Official way to clean unused images is
+         oc adm prune images --keep-tag-revisions=3 --keep-younger-than=60m --confirm
+    * This is, however, will keep all images referenced by exisitng bc, dc, rc, and pods (see above). So, it could be worth cleaning OpenShift resources
+      before before proceeding with images. If images doesn't go, it worth also tryig to clean orphaned containers.
+    * Some images could be also orphanned by OpenShift infrastructure.  OpenShift supports 'hard' prunning to handle such images.
+        https://docs.openshift.com/container-platform/3.7/admin_guide/pruning_resources.html
+      First check if something needs to be done:
+        oc -n default exec -i -t "$(oc -n default get pods -l deploymentconfig=docker-registry -o jsonpath=$'{.items[0].metadata.name}\n')" -- /usr/bin/dockerregistry -prune=check
+      If there is many orphans, the hard pruning can be executed. This requires additional permissions 
+      for service account running docker-registry
+        service_account=$(oc get -n default -o jsonpath=$'system:serviceaccount:{.metadata.namespace}:{.spec.template.spec.serviceAccountName}\n' dc/docker-registry)
+        oc adm policy add-cluster-role-to-user system:image-pruner ${service_account}
+       and  should be done with docker registry in read-only mode (requires restart of default/docker-registry containers)
+        oc env -n default dc/docker-registry 'REGISTRY_STORAGE_MAINTENANCE_READONLY={"enabled":true}'               # wait until new pods rolled out
+        oc -n default exec -i -t "$(oc -n default get pods -l deploymentconfig=docker-registry -o jsonpath=$'{.items[0].metadata.name}\n')" -- /usr/bin/dockerregistry -prune=delete
+        oc env -n default dc/docker-registry REGISTRY_STORAGE_MAINTENANCE_READONLY-
+
+ - Cleaning old images which doesn't want to go.
+    * Investigating image streams and manually deleting the old versions of the images
+        oc get is adei -o yaml
+        oc delete image sha256:04afd4d4a0481e1510f12d6d071f1dceddef27416eb922cf524a61281257c66e
+    * Cleaning old dangling images using docker (on all nodes). Tried and as it seems caused no issues to the operation of the cluster.
+        docker rmi $(docker images --filter "dangling=true" -q --no-trunc)
author	Suren A. Chilingaryan <csa@suren.me>	2019-10-06 05:00:55 +0200
committer	Suren A. Chilingaryan <csa@suren.me>	2019-10-06 05:00:55 +0200
commit	ba144fab071258a97cf3c42a0defeb0aae41a353 (patch)
tree	2e738d4e4774d754b56d79021cc8781b3c0835a5 /docs/maintenance.txt
parent	efe4b9bbe3c9cb950378de9697eed2030ac49ca2 (diff)
download	ands-ba144fab071258a97cf3c42a0defeb0aae41a353.tar.gz ands-ba144fab071258a97cf3c42a0defeb0aae41a353.tar.bz2 ands-ba144fab071258a97cf3c42a0defeb0aae41a353.tar.xz ands-ba144fab071258a97cf3c42a0defeb0aae41a353.zip