When you provision a kubernetes cluster with kubeadm etcd is a static pod and its configuration file etcd.yaml is located in the /etc/kubernetes/manifests/ directory.
Assuming a non-production installation, I happened across a case where a backup of etcd was taken, a deployment was deleted and then the backup was restored. Naturally the expected result after editing etcd.yaml so that data-dir pointed to the restored database, was for the previously deleted deployment to reappear. It did not! Six restores in a row did not result in bringing it back. Let’s see the steps taken in a test cluster created to replicate what happened:
First, two deployments were created:
$ kubectl create deployment nginx --image nginx
deployment.apps/nginx created
$ kubectl create deployment httpd --image httpd
deployment.apps/httpd created
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
httpd-757fb56c8d-vhftq 1/1 Running 0 4s
nginx-6799fc88d8-xklhw 1/1 Running 0 11s
Next, a snapshot of the etcd was requested:
$ kubectl -n kube-system exec -it etcd-ip-10-168-1-35 -- sh -c "ETCDCTL_API=3 \
ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt ETCDCTL_CERT=/etc/kubernetes/pki/etcd/server.crt \
ETCDCTL_KEY=/etc/kubernetes/pki/etcd/server.key etcdctl --endpoints=https://127.0.0.1:2379 \
snapshot save /var/lib/etcd/snapshot.db "
:
:
{"level":"info","ts":1637177906.1665,"caller":"snapshot/v3_snapshot.go:152","msg":"saved","path":"/var/lib/etcd/snapshot.db"}
Snapshot saved at /var/lib/etcd/snapshot.db
Oh my god, we deleted an important deployment!
$ kubectl delete deployment nginx
deployment.apps "nginx" deleted
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
httpd-757fb56c8d-vhftq 1/1 Running 0 53s
nginx-6799fc88d8-xklhw 0/1 Terminating 0 60s
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
httpd-757fb56c8d-vhftq 1/1 Running 0 114s
Quick! Bring it back. First let’s restore the snapshot we have, shall we?
$ kubectl -n kube-system exec -it etcd-ip-10-168-1-35 -- sh -c "ETCDCTL_API=3 \
ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt ETCDCTL_CERT=/etc/kubernetes/pki/etcd/server.crt \
ETCDCTL_KEY=/etc/kubernetes/pki/etcd/server.key etcdctl --endpoints=https://127.0.0.1:2379 \
snapshot restore --data-dir=/var/lib/etcd/restore /var/lib/etcd/snapshot.db "
:
:
{"level":"info","ts":1637178021.3886964,"caller":"snapshot/v3_snapshot.go:309","msg":"restored snapshot","path":"/var/lib/etcd/snapshot.db","wal-dir":"/var/lib/etcd/restore/member/wal","data-dir":"/var/lib/etcd/restore","snap-dir":"/var/lib/etcd/restore/member/snap"}
And now just edit /etc/kubernetes/manifests/etcd.yaml so that it points to the restored directory:
- --data-dir=/var/lib/etcd/restore
And after kubelet does its thing for a minute or two, it should work, right? No:
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
httpd-757fb56c8d-vhftq 1/1 Running 0 11m
This was the situation I was pointed at and asked to offer an opinion.
Could there be an issue with etcd?
journalctl -u kubelet | grep etcd
reveals nothing.
kubectl -n kube-system logs etcd-ip-10-168-1-35
does not reveal anything:
:
:
2021-11-17 19:50:24.208303 I | etcdserver/api/etcdhttp: /health OK (status code 200)
2021-11-17 19:50:34.208063 I | etcdserver/api/etcdhttp: /health OK (status code 200)
But look at this:
$ kubectl -n kube-system logs etcd-ip-10-168-1-35 | grep restore
2021-11-17 19:48:34.261932 W | etcdmain: found invalid file/dir restore under data dir /var/lib/etcd (Ignore this if you are upgrading etcd)
2021-11-17 19:48:34.293681 I | mvcc: restore compact to 1121
So there must be something there that directs etcd to read from /var/lib/etcd and not from /var/lib/etcd/restore. What could it be?
# ls /etc/kubernetes/manifests/
etcd.yaml httpd.yaml kube-controller-manager.yaml
etcd.yaml.orig kube-apiserver.yaml kube-scheduler.yaml
The person who asked my opinion thoughtfully wanted to have a backup of the etcd.yaml
file. Only it happened that keeping it in the same directory messed up the setup. Look what happens next:
$ sudo rm /etc/kubernetes/manifests/etcd.yaml.orig
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
httpd-757fb56c8d-vhftq 1/1 Running 0 16m
nginx-6799fc88d8-xklhw 1/1 Running 0 16m
Note that the nginx pod returned with the exact same name as before.
So the takeaway from this adventure is that kubelet reads all files in /etc/kubernetes/manifests not only the *.yaml files and thus do not keep older versions of files in there, for results will be unexpected.