diff options
author | Suren A. Chilingaryan <csa@suren.me> | 2019-10-06 05:00:55 +0200 |
---|---|---|
committer | Suren A. Chilingaryan <csa@suren.me> | 2019-10-06 05:00:55 +0200 |
commit | ba144fab071258a97cf3c42a0defeb0aae41a353 (patch) | |
tree | 2e738d4e4774d754b56d79021cc8781b3c0835a5 /docs/maintenance.txt | |
parent | efe4b9bbe3c9cb950378de9697eed2030ac49ca2 (diff) | |
download | ands-ba144fab071258a97cf3c42a0defeb0aae41a353.tar.gz ands-ba144fab071258a97cf3c42a0defeb0aae41a353.tar.bz2 ands-ba144fab071258a97cf3c42a0defeb0aae41a353.tar.xz ands-ba144fab071258a97cf3c42a0defeb0aae41a353.zip |
Document latest problems with docker images and resource reclaimation, add docker performance checks in the monitoring scripts, helpers to filter the logs
Diffstat (limited to 'docs/maintenance.txt')
-rw-r--r-- | docs/maintenance.txt | 55 |
1 files changed, 55 insertions, 0 deletions
diff --git a/docs/maintenance.txt b/docs/maintenance.txt new file mode 100644 index 0000000..9f52e18 --- /dev/null +++ b/docs/maintenance.txt @@ -0,0 +1,55 @@ +Unused resources +================ + ! Cleaning of images is necessary if amount of resident images grow above 1000. Everything else has not caused problems yet and could + be ignored unless blocking other actions (e.g. clean-up of old images) + + - Deployments. As is this hasn't caused problems yet, but old versions of 'rc' may block removal of the old images and this may + have negative impact on performance. + oc adm prune deployments --orphans --keep-complete=3 --keep-failed=1 --keep-younger-than=60m --confirm + oc adm prune builds --orphans --keep-complete=3 --keep-failed=1 --keep-younger-than=60m --confirm + * This is, however, does not clean old 'rc' controllers which are allowed by 'revisionHistoryLimit' (and may be something else as + well). There is a script included to clean such controllers 'prunerc.sh' + + - OpenShift sometimes fails to clean stopped containers. This containers again may block removal of images (and likely on itself also + can use Docker performance penalties if accumulated). + * The lost containers can be identified by looking into the /var/log/messages. + PodSandbox "aa28e9c7605cae088838bb4c9b92172083680880cd4c085d93cbc33b5b9e8910" from runtime service failed: ... + * We can find and remove the corresponding container (the short id is just first letters of the long id) + docker ps -a | grep aa28e9c76 + docker rm <id> + * But in general any not-running container which is for a long time remains in stopped state could be considered lost. We can remove + all of them or just ones related to the specific image (if we are cleaning images and something blocks deletion of an old version) + docker rm $(docker ps -a | grep Exited | grep adei | awk '{ print $1 }') + + - If cleaning containers manually or/and forcing termination of pods, some remnants could be left in '/var/lib/origin/openshift.local.volumes/pods' + * Probably, it is also could happen in other cases. This can be detected by looking in /var/log/messages for something like + Orphaned pod "212074ca-1d15-11e8-9de3-525400225b53" found, but volume paths are still present on disk. + * If unknown, the location for the pod in question could be found with 'find . -name heketi*' or something like (the containers names will be listed + under this subdirectory, so they can be used in search)... + * There could be problematic mounts which can be freed with lazy umount + * The folders for removed pods may (and should) be removed. + + - Prunning unused images (this is required as if large amount is accumulated, the additional latencies in communication with docker + daemon will be inrtoduced and result in severe penalties to scheduling performance). Official way to clean unused images is + oc adm prune images --keep-tag-revisions=3 --keep-younger-than=60m --confirm + * This is, however, will keep all images referenced by exisitng bc, dc, rc, and pods (see above). So, it could be worth cleaning OpenShift resources + before before proceeding with images. If images doesn't go, it worth also tryig to clean orphaned containers. + * Some images could be also orphanned by OpenShift infrastructure. OpenShift supports 'hard' prunning to handle such images. + https://docs.openshift.com/container-platform/3.7/admin_guide/pruning_resources.html + First check if something needs to be done: + oc -n default exec -i -t "$(oc -n default get pods -l deploymentconfig=docker-registry -o jsonpath=$'{.items[0].metadata.name}\n')" -- /usr/bin/dockerregistry -prune=check + If there is many orphans, the hard pruning can be executed. This requires additional permissions + for service account running docker-registry + service_account=$(oc get -n default -o jsonpath=$'system:serviceaccount:{.metadata.namespace}:{.spec.template.spec.serviceAccountName}\n' dc/docker-registry) + oc adm policy add-cluster-role-to-user system:image-pruner ${service_account} + and should be done with docker registry in read-only mode (requires restart of default/docker-registry containers) + oc env -n default dc/docker-registry 'REGISTRY_STORAGE_MAINTENANCE_READONLY={"enabled":true}' # wait until new pods rolled out + oc -n default exec -i -t "$(oc -n default get pods -l deploymentconfig=docker-registry -o jsonpath=$'{.items[0].metadata.name}\n')" -- /usr/bin/dockerregistry -prune=delete + oc env -n default dc/docker-registry REGISTRY_STORAGE_MAINTENANCE_READONLY- + + - Cleaning old images which doesn't want to go. + * Investigating image streams and manually deleting the old versions of the images + oc get is adei -o yaml + oc delete image sha256:04afd4d4a0481e1510f12d6d071f1dceddef27416eb922cf524a61281257c66e + * Cleaning old dangling images using docker (on all nodes). Tried and as it seems caused no issues to the operation of the cluster. + docker rmi $(docker images --filter "dangling=true" -q --no-trunc) |