Kubernetes Short Notes(3)
Cluster Maintainance
OS Upgrade
Pod Eviction Timeout
When the nodes was down for more than 5 minute(default) then the pods are terminated; pod will recreate if has replicaset
Drain, Cordon, Uncordon
We’re not sure the node will come back online in 5 minutes, therefore we can drain the node.
After the drained node upgraded and come back, it still unschedulable, uncordon the node to make it schedulable.
Note that the previouse pods won’t be automatically reschedule back to the nodes.
Cluster Upgrade
The core control plane components’s version can be different, but should follow certain rules:
- the kube-api is the primary component, none of the other components’s version must not be higher than the kube-api
- the components can be lower in 1-2 versions
- kube-api: x
- Controlloer-manager, kube-scheduler: x, x-1
- kubelet, kube-proxy: x, x-1, x-2
- the kubectl can be one version higher than kube-api: x+1, x, x-1
The kubernetes support only up to the recent 3 minor versions. The recommanded approch is to update one minor version at a time.
Update the cluster depend on how you deploy them:
- cloud provider: few clicks at the UI
- kubeadm: using
upgrade
argument (you should upgrade the kubeadm first!) - the hard way from scratch: manually upgrade components by yourself
Two major steps:
- upgrade master node, the control plane componets goes down, all management function are down, only the applications deploy on worker nodes keeps serving
- update worker nodes, with strategies:
- upgrade all at once with downtimes
- upgrade one at a time
- create new nodes and remove the workloads, then finally remove old nodes
When you run command like kubectl get nodes
, the VERSION
is indicat the version of the kubelet
Backup and Restore
Master / Node DR
- Cordon & drain
- Provision replacement master / node
ETCD DR
Option: Backup resources
Saving objects as a copy by query the kube-api
1 2 |
kubectl get all --namespace=default -o yaml > default-deplayment-services.yaml |
Option: Backup ETCD
Making copies of the ETCD data directory
1 2 3 4 5 6 |
# etcd.service ExecStart=/user/local/bin/etcd \ --name= ${ETCD_NAME} ... --data-dir=/var/lib/etcd |
Or use the etcd command line tool
- Make a snap shot
123456ETCDTL_API=3 etcdctl backup save etcd.db \--endpoint=http://127.0.0.1:2379 \--cacert=/etc/etcd/ca.crt \--cert=/etc/etcd/etcd-server.crt \--key=/etc/etcd/etcd
Remember to specify the certification files for authentication - Stop kube-api
12service kube-apiserver stop - Restore snapshot
123456ETCDCTL_API=3 etcdctl snapshot restore etcd.db \-- data-dir=/var/lib/etcd-backup-dir \--initial-cluster master-1=https://192.169.5.11:2380,master-2=https://192.168.5.12:2380 \--initial-cluster-token etck-cluster-1 \--initial-advertise-peer-urls https://${INTERNAL_IP}:2380When ETCD restore from a backup, it initialize a new cluster configuration and configures the members of ETCD as new members to a new cluster. This is to prevent a new member from accidentally joining an existing cluster.
For example, using a snapshot to provision a new etcd-cluster from testing purpose. You don’t want the members in the new test cluster to accidentally join the production cluster. - Configure the etcd.service with new data directory and new cluster token
During a restore, you must provide a new cluster token and the same initial cluster configuration
- Restart ETCD service
123systemctl daemon-reloadservice etcd restart - Start kube-api
12service kube-apiserver start
Persistant Volume DR
You can’t relay on kubernetes to for backing up and restore persistant volumes.
If you’re using cloud provider specific persistant volumes like EBS volumes, Azure managed disks or GCE persistent disks, you should use cloud provider snapshot APIs