2019

Cluster Maintainance

OS Upgrade

Pod Eviction Timeout

When the nodes was down for more than 5 minute(default) then the pods are terminated; pod will recreate if has replicaset

Drain, Cordon, Uncordon

We’re not sure the node will come back online in 5 minutes, therefore we can drain the node.

After the drained node upgraded and come back, it still unschedulable, uncordon the node to make it schedulable.

Note that the previouse pods won’t be automatically reschedule back to the nodes.

Cluster Upgrade

The core control plane components’s version can be different, but should follow certain rules:

the kube-api is the primary component, none of the other components’s version must not be higher than the kube-api
the components can be lower in 1-2 versions
- kube-api: x
- Controlloer-manager, kube-scheduler: x, x-1
- kubelet, kube-proxy: x, x-1, x-2
the kubectl can be one version higher than kube-api: x+1, x, x-1

The kubernetes support only up to the recent 3 minor versions. The recommanded approch is to update one minor version at a time.

Update the cluster depend on how you deploy them:

cloud provider: few clicks at the UI
kubeadm: using upgrade argument (you should upgrade the kubeadm first!)
the hard way from scratch: manually upgrade components by yourself

Two major steps:

upgrade master node, the control plane componets goes down, all management function are down, only the applications deploy on worker nodes keeps serving
update worker nodes, with strategies:
- upgrade all at once with downtimes
- upgrade one at a time
- create new nodes and remove the workloads, then finally remove old nodes

When you run command like kubectl get nodes, the VERSION is indicat the version of the kubelet

Backup and Restore

Master / Node DR

Cordon & drain
Provision replacement master / node

ETCD DR

Option: Backup resources

Saving objects as a copy by query the kube-api

1 2	kubectl get all --namespace=default -o yaml > default-deplayment-services.yaml

Option: Backup ETCD

Making copies of the ETCD data directory

# etcd.service

ExecStart=/user/local/bin/etcd \

--name= ${ETCD_NAME}

...

--data-dir=/var/lib/etcd

Or use the etcd command line tool

Make a snap shot

ETCDTL_API=3 etcdctl backup save etcd.db \

--endpoint=http://127.0.0.1:2379 \

--cacert=/etc/etcd/ca.crt \

--cert=/etc/etcd/etcd-server.crt \

--key=/etc/etcd/etcd

Remember to specify the certification files for authentication

Stop kube-api

1
2

service kube-apiserver stop

Restore snapshot

ETCDCTL_API=3 etcdctl snapshot restore etcd.db \

-- data-dir=/var/lib/etcd-backup-dir \

--initial-cluster master-1=https://192.169.5.11:2380,master-2=https://192.168.5.12:2380 \

--initial-cluster-token etck-cluster-1 \

--initial-advertise-peer-urls https://${INTERNAL_IP}:2380

When ETCD restore from a backup, it initialize a new cluster configuration and configures the members of ETCD as new members to a new cluster. This is to prevent a new member from accidentally joining an existing cluster.
For example, using a snapshot to provision a new etcd-cluster from testing purpose. You don’t want the members in the new test cluster to accidentally join the production cluster.

Configure the etcd.service with new data directory and new cluster token

During a restore, you must provide a new cluster token and the same initial cluster configuration
Restart ETCD service

1
2
3

systemctl daemon-reload
service etcd restart
Start kube-api

1
2

service kube-apiserver start

Persistant Volume DR

You can’t relay on kubernetes to for backing up and restore persistant volumes.

If you’re using cloud provider specific persistant volumes like EBS volumes, Azure managed disks or GCE persistent disks, you should use cloud provider snapshot APIs

無處可逃

2019 年 10 月 8 日2019 年 10 月 8 日
Quote

意思是，當達碧還在年高中時，癌細胞就在她媽媽體內生長了。當她說謊解釋皮包裡為什麼會有一張破損的西爾斯百貨標籤時，癌細胞就在那了；當她嗑了劣質搖頭丸，手腕上戴著綠色螢光手環，在凌晨三點昏昏沉沉開車回家，噙著淚水的媽媽在門廊上大罵她是賤貨時，癌細胞就在那了。那個看不見的傢伙始終棲息在媽媽肩膀上，偷聽她們說話。媽媽一在慢慢死去，但兩人都毫無所覺。

珊蒂似乎很喜歡把他當孩子照顧，彷彿這樣才能給自己恨他的理由。

Scheduling

Manual Scheduling

Bind the pod to node by nodeName property, before that, the pod stays in the Pending state

Manutal ways to bind:

specify the spec.nodeName, not updatable

create the Binding object

curl --header "Content-Type:application/json" \

--request POST --data '{"apiVersion": "v1", "kind": "Binding" ...}' \

http://$SERVER/api/v1/namespaces/default/pods/$PODNAME/binding/

Labeling

Use to group and select the objects, for example a ReplicaSet object configs:

metadata.labels sets the ReplicaSet itself
spec.template.metadata.lables sets the Pod
spec.selector.matchLabels defines how ReplicaSet to discover the Pod

Annotations

Use to record other details for intergration purpose e.g. build info, contact details

Restriction

Taint/Toleration

Limit pods without tolerations cannot get scheduled to a tainted node

Taint the nodes

1
2
3

kubectl taint nodes master node-role.kubernetes.io/master:NoSchedule # Taint
ubectl taint nodes master node-role.kubernetes.io/master:NoSchedule- # UnTaint
Set the pods’ tolerance, three behavior are avaliable if not tolerant:
- NoSchedule
PreferNoSchedule: not guaranteed

NoExecute: new pods=NoSchedule, existed pods=evicted

Note the value in tolerations keys must use double quotes

Node Selector

Limit the pod to get scheduled to one kind of node only

Lable the node
Set the nodeSelector

Note there is no OR or NOT conditions, use node affinity instead

Node Affinity

Limit the pod to get scheduled to one or more particular nodes

Lable the node
Set the nodeAffinity
operators: In, NotIn, Exists, DoesNotExist, Gt, Lt
3 types

Combines the Taint/Toleration with NodeSelector or NodeAffinity to cover the scenarios

Resources

Request

The scheduling base on the resource requests
By default, k8s assumes a pod requires 0.5 cpu and 256Mi memory

Limit

By default, k8s limit a pod to 1 cpu and 512Mi memory
When a pod try to exceed resources beyond the limit
- cpu: k8s throttles the cpu won’t kill
- memory: k8s kill the pod with OOM

Static Pods

Use in creating control plane components (kube admin tools)

Without the intervention from the kube-api server, the kubelet can manage a node independently by monitor config files in the file system, and be able to create, recreate, update and delete the POD only object

–pod-manifest-path=/etc/Kubernetes/manifest
–config=kubeconfig.yaml (staticPodPath)

While the static pod created, the kube-api only get a readable mirror and not have the ability to update/delete it

Multiple Scheduler

copy the kube-scheduler configs from /etc/kubernetes/manifests
rename the scheduler --scheduler-name
if one master nodes with multiple scheduler:
- set the --leader-elect=false
if multiple masters with multiple scheduler, only one scheduler can active at a time
- set the --leader-elect=true
- set the --lock-object-name to differentiate the custom scheduler from default if multiple master
specify the scheduler for pod by schedulerName

2019 年 9 月 27 日2020 年 1 月 6 日
Ops

Cluster Architecture

Master Node

ETCD cluster
kube-scheduler
kube-controller-manager

These components communicate via kube-api server

Worker Node

container runtime engine, e.g. Docker, Rocket, ContainerD
kubelet: agent that runs and listen for instructions from kube-api
containers

The services deploy within worker nodes communicate with each other via kube-proxy

Objectives

ETCD

a distributed reliable key-value store
client commuications on port 2379
server to server on port 2380

kube-api

primary management component
setup:
1. using kube-admin tools
  - deploy kube-api as a pod in kube-system namespace
  - the manifests is at /etc/kubernetes/manifests/kube-apiserver.yaml
  - the options is at /etc/systemd/system/kube-apiserver.service
  - search for kube-apiserver process on master node
    
    1
    2
    
    ps -aux | grep kube-apiserver
example: apply deployment using kubectl
1. authenticates user
validate the HTTP requests

the kube-scheduler monitored the changes from the kube-api, then:

retrieve the node information from kube-api

schedule the pod to some node through kube-api to kubelet

update the pod info to ETCD

kube-controller-manager

continuously monitors the state of components
the controllers packages into a single process called Kube-Controller-Manager, which includes:
1. deployment-controller, cronjob, service-account-controller …
2. namespace-controller, job-contorller, node-controller …
3. endpoint-controller, replicaset, replication-controller(replica set) …
remediate situation

kube-scheduler

decide which pod goes to which node
1. filter nodes
2. rank nodes

kubelet

follow the instruction from kube-scheduler to controll the container runtime engine (e.g. docker) that run or remove a container
using kube-admin tools to deploy cluster, the kubelet are not installed by default in worker nodes, need intstall manually

kube-proxy

runs on each nodes in the cluster
create iptables rules on each nodes to forward traffic heading to the IP of the services to the IP of the actual pods
kube-admin tool deploy kube-proxy as daemonset in each nodes

pod

the container are encapsulated into a pod
is a single instance of an application, the smallest object in k8s
containers in same pod shares storages and network namespaces, created and removed in the same time
multi-container pod is rare use case

ReplicationController

apiVersion support in v1
the process to monitor the pods
maintain the HA and specified number of pods that running on all nodes
only care about the pod which RestartPolicy is set to Always
scalable and replacable application should be managed by the controller
use cases: rolling updates, multiple release tracks (multiple replication controller replica the same pod but using different labels)

ReplicaSets

next generation of ReplicationController
api version support in apps/v1
enhance the filtering in .spec.selector (the major difference)
be aware of the non-template pod that has same lables
using Deployment as a replacement is recommended, it own and manage its ReplicaSets

Deployment

provide replication vis replicaset and other:
- rolling update
- rollout
- pause and resume

Namespace

namespaces created at cluster creation
1. kube-system
kube-public

default

each namespace can be assigned quota of resources
a DNS entry with SERVICE_NAME.NAMESPACE.svc.cluster.local format is automatically created when at service creation
1. the cluster.local is the default domain name of the cluster
permanently config the namespace

1
2

kubectl config set-context $(kubectl config current-context) --namespace=$NAMESPACE

2019 年 8 月 23 日2019 年 8 月 27 日
Python

Generator as Coroutines

cooperative multitasking (cooperative routines)
concurrent not parallel (python program execute on a single thread)

The way to create coroutines:

generators (asyncio)
native coroutines (using async /await)

Concepts

concurrency: tasks start, run and complete in overlapping time periods
parallelism: tasks run simultaneousely

cooperative: control relinquished to other task voluntarily, control by application(developer)
preemptive: control relinquished to other task involuntarily, control by the OS.
some sort of scheduler involved

Global Interpreter Lock(GIL)
Only one native thread excutes at a time.

Use Process based parallelism to avoid GIL. Not Thread based.

The Python threading module uses threads instead of processes. Threads uniquely run in the same unique memory heap. Whereas Processes run in separate memory heaps. This makes sharing information harder with processes and object instances. One problem arises because threads use the same memory heap, multiple threads can write to the same location in the memory heap which is why the global interpreter lock(GIL) in CPython was created as a mutex to prevent it from happening.

Make the right choice

CPU Bound => Multi processing
I/O Bound, Fast I/O, Limit Connections => Muilti Threading
I/O Bound, Slow I/O, Many Connections => Concurrency

Use deque

Much more efficient way to implement the stack and queue.

Operate 10,000 items take 1,000 times average:

(times in seconds)	list	deque
append(right)	0.87	0.87
pop(right)	0.002	0.0005
insert(left)	20.8	0.84
pop(left)	0.012	0.0005

Use unlimited deque with deque() or deque(iterable)
Use limited deque with deque(maxlen=n). If full, a corresponding number of items are discarded from the opposite end.

Implement producer / consumer coroutine using deque

Implement simple event loop

Context Manager

2019 年 8 月 20 日2019 年 8 月 23 日
Python

Context Manager

what is context

the state surrounding a section of code

why we need a context manager

writing try/finally every time can get cumbersom
easy to forget closing the file

use cases

Useful for program that needs Enter / Exit handeling

create / releasing resources
database transaction
set and reset decimal context

Common patterns

open / close
lock / release
change / reset
start / stop
enter / exit

protocal

implement these two dunder methods:

__enter__
perform the setup, optionally return an object
__exit__
receives error (silence or propagate)
- need arguments exc_type, exc_value, exc_trace to handle exception
- return True to silence exception
perform clean up

examples

contextlib

nested contexts

2019

Kubernetes Short Notes(3)

Cluster Maintainance

OS Upgrade

Cluster Upgrade

Backup and Restore

無處可逃

Kubernetes Short Notes (2)

Scheduling

Manual Scheduling

Labeling

Annotations

Restriction

Resources

Static Pods

Multiple Scheduler

Kubernetes Short Notes (1)

Cluster Architecture

Objectives

Generator as Coroutines

Generator as Coroutines

Concepts

Make the right choice

Use deque

Context Manager

Context Manager