The case of an etcd restore that was not happening

When you provision a kubernetes cluster with kubeadm etcd is a static pod and its configuration file etcd.yaml is located in the /etc/kubernetes/manifests/ directory.

Assuming a non-production installation, I happened across a case where a backup of etcd was taken, a deployment was deleted and then the backup was restored. Naturally the expected result after editing etcd.yaml so that data-dir pointed to the restored database, was for the previously deleted deployment to reappear. It did not! Six restores in a row did not result in bringing it back. Let’s see the steps taken in a test cluster created to replicate what happened:

First, two deployments were created:

$ kubectl create deployment nginx --image nginx
deployment.apps/nginx created

$ kubectl create deployment httpd --image httpd
deployment.apps/httpd created

$ kubectl get pod
NAME                     READY   STATUS    RESTARTS   AGE
httpd-757fb56c8d-vhftq   1/1     Running   0          4s
nginx-6799fc88d8-xklhw   1/1     Running   0          11s

Next, a snapshot of the etcd was requested:

$ kubectl -n kube-system exec -it etcd-ip-10-168-1-35 -- sh -c "ETCDCTL_API=3 \
ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt ETCDCTL_CERT=/etc/kubernetes/pki/etcd/server.crt \
ETCDCTL_KEY=/etc/kubernetes/pki/etcd/server.key etcdctl --endpoints=https://127.0.0.1:2379 \
snapshot save /var/lib/etcd/snapshot.db "
:
:
{"level":"info","ts":1637177906.1665,"caller":"snapshot/v3_snapshot.go:152","msg":"saved","path":"/var/lib/etcd/snapshot.db"}
Snapshot saved at /var/lib/etcd/snapshot.db

Oh my god, we deleted an important deployment!

$ kubectl delete deployment nginx 
deployment.apps "nginx" deleted

$ kubectl get pod
NAME                     READY   STATUS        RESTARTS   AGE
httpd-757fb56c8d-vhftq   1/1     Running       0          53s
nginx-6799fc88d8-xklhw   0/1     Terminating   0          60s

$ kubectl get pod
NAME                     READY   STATUS    RESTARTS   AGE
httpd-757fb56c8d-vhftq   1/1     Running   0          114s

Quick! Bring it back. First let’s restore the snapshot we have, shall we?

$ kubectl -n kube-system exec -it etcd-ip-10-168-1-35 -- sh -c "ETCDCTL_API=3 \
ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt ETCDCTL_CERT=/etc/kubernetes/pki/etcd/server.crt \
ETCDCTL_KEY=/etc/kubernetes/pki/etcd/server.key etcdctl --endpoints=https://127.0.0.1:2379 \
snapshot restore --data-dir=/var/lib/etcd/restore /var/lib/etcd/snapshot.db "
:
:
{"level":"info","ts":1637178021.3886964,"caller":"snapshot/v3_snapshot.go:309","msg":"restored snapshot","path":"/var/lib/etcd/snapshot.db","wal-dir":"/var/lib/etcd/restore/member/wal","data-dir":"/var/lib/etcd/restore","snap-dir":"/var/lib/etcd/restore/member/snap"}

And now just edit /etc/kubernetes/manifests/etcd.yaml so that it points to the restored directory:

- --data-dir=/var/lib/etcd/restore

And after kubelet does its thing for a minute or two, it should work, right? No:

$ kubectl get pod
NAME                     READY   STATUS    RESTARTS   AGE
httpd-757fb56c8d-vhftq   1/1     Running   0          11m

This was the situation I was pointed at and asked to offer an opinion.

Could there be an issue with etcd?

journalctl -u kubelet | grep etcd reveals nothing.

kubectl -n kube-system logs etcd-ip-10-168-1-35 does not reveal anything:

:
:
2021-11-17 19:50:24.208303 I | etcdserver/api/etcdhttp: /health OK (status code 200)
2021-11-17 19:50:34.208063 I | etcdserver/api/etcdhttp: /health OK (status code 200)

But look at this:

$ kubectl -n kube-system logs etcd-ip-10-168-1-35 | grep restore
2021-11-17 19:48:34.261932 W | etcdmain: found invalid file/dir restore under data dir /var/lib/etcd (Ignore this if you are upgrading etcd)
2021-11-17 19:48:34.293681 I | mvcc: restore compact to 1121

So there must be something there that directs etcd to read from /var/lib/etcd and not from /var/lib/etcd/restore. What could it be?

# ls /etc/kubernetes/manifests/
etcd.yaml       httpd.yaml           kube-controller-manager.yaml
etcd.yaml.orig  kube-apiserver.yaml  kube-scheduler.yaml

The person who asked my opinion thoughtfully wanted to have a backup of the etcd.yaml file. Only it happened that keeping it in the same directory messed up the setup. Look what happens next:

$ sudo rm /etc/kubernetes/manifests/etcd.yaml.orig 

$ kubectl get pod
NAME                     READY   STATUS    RESTARTS   AGE
httpd-757fb56c8d-vhftq   1/1     Running   0          16m
nginx-6799fc88d8-xklhw   1/1     Running   0          16m

Note that the nginx pod returned with the exact same name as before.

So the takeaway from this adventure is that kubelet reads all files in /etc/kubernetes/manifests not only the *.yaml files and thus do not keep older versions of files in there, for results will be unexpected.

The case of an etcd restore that was not happening

Published by adamo

Leave a comment Cancel reply

Share this:

Published by adamo

Leave a comment Cancel reply