Ops

Cluster Maintainance

OS Upgrade

Pod Eviction Timeout

When the nodes was down for more than 5 minute(default) then the pods are terminated; pod will recreate if has replicaset

Drain, Cordon, Uncordon

We’re not sure the node will come back online in 5 minutes, therefore we can drain the node.

After the drained node upgraded and come back, it still unschedulable, uncordon the node to make it schedulable.

Note that the previouse pods won’t be automatically reschedule back to the nodes.

Cluster Upgrade

The core control plane components’s version can be different, but should follow certain rules:

the kube-api is the primary component, none of the other components’s version must not be higher than the kube-api
the components can be lower in 1-2 versions
- kube-api: x
- Controlloer-manager, kube-scheduler: x, x-1
- kubelet, kube-proxy: x, x-1, x-2
the kubectl can be one version higher than kube-api: x+1, x, x-1

The kubernetes support only up to the recent 3 minor versions. The recommanded approch is to update one minor version at a time.

Update the cluster depend on how you deploy them:

cloud provider: few clicks at the UI
kubeadm: using upgrade argument (you should upgrade the kubeadm first!)
the hard way from scratch: manually upgrade components by yourself

Two major steps:

upgrade master node, the control plane componets goes down, all management function are down, only the applications deploy on worker nodes keeps serving
update worker nodes, with strategies:
- upgrade all at once with downtimes
- upgrade one at a time
- create new nodes and remove the workloads, then finally remove old nodes

When you run command like kubectl get nodes, the VERSION is indicat the version of the kubelet

Backup and Restore

Master / Node DR

Cordon & drain
Provision replacement master / node

ETCD DR

Option: Backup resources

Saving objects as a copy by query the kube-api

1 2	kubectl get all --namespace=default -o yaml > default-deplayment-services.yaml

Option: Backup ETCD

Making copies of the ETCD data directory

# etcd.service

ExecStart=/user/local/bin/etcd \

--name= ${ETCD_NAME}

...

--data-dir=/var/lib/etcd

Or use the etcd command line tool

Make a snap shot

ETCDTL_API=3 etcdctl backup save etcd.db \

--endpoint=http://127.0.0.1:2379 \

--cacert=/etc/etcd/ca.crt \

--cert=/etc/etcd/etcd-server.crt \

--key=/etc/etcd/etcd

Remember to specify the certification files for authentication

Stop kube-api

1
2

service kube-apiserver stop

Restore snapshot

ETCDCTL_API=3 etcdctl snapshot restore etcd.db \

-- data-dir=/var/lib/etcd-backup-dir \

--initial-cluster master-1=https://192.169.5.11:2380,master-2=https://192.168.5.12:2380 \

--initial-cluster-token etck-cluster-1 \

--initial-advertise-peer-urls https://${INTERNAL_IP}:2380

When ETCD restore from a backup, it initialize a new cluster configuration and configures the members of ETCD as new members to a new cluster. This is to prevent a new member from accidentally joining an existing cluster.
For example, using a snapshot to provision a new etcd-cluster from testing purpose. You don’t want the members in the new test cluster to accidentally join the production cluster.

Configure the etcd.service with new data directory and new cluster token

During a restore, you must provide a new cluster token and the same initial cluster configuration
Restart ETCD service

1
2
3

systemctl daemon-reload
service etcd restart
Start kube-api

1
2

service kube-apiserver start

Persistant Volume DR

You can’t relay on kubernetes to for backing up and restore persistant volumes.

If you’re using cloud provider specific persistant volumes like EBS volumes, Azure managed disks or GCE persistent disks, you should use cloud provider snapshot APIs

Scheduling

Manual Scheduling

Bind the pod to node by nodeName property, before that, the pod stays in the Pending state

Manutal ways to bind:

specify the spec.nodeName, not updatable

create the Binding object

curl --header "Content-Type:application/json" \

--request POST --data '{"apiVersion": "v1", "kind": "Binding" ...}' \

http://$SERVER/api/v1/namespaces/default/pods/$PODNAME/binding/

Labeling

Use to group and select the objects, for example a ReplicaSet object configs:

metadata.labels sets the ReplicaSet itself
spec.template.metadata.lables sets the Pod
spec.selector.matchLabels defines how ReplicaSet to discover the Pod

Annotations

Use to record other details for intergration purpose e.g. build info, contact details

Restriction

Taint/Toleration

Limit pods without tolerations cannot get scheduled to a tainted node

Taint the nodes

1
2
3

kubectl taint nodes master node-role.kubernetes.io/master:NoSchedule # Taint
ubectl taint nodes master node-role.kubernetes.io/master:NoSchedule- # UnTaint
Set the pods’ tolerance, three behavior are avaliable if not tolerant:
- NoSchedule
PreferNoSchedule: not guaranteed

NoExecute: new pods=NoSchedule, existed pods=evicted

Note the value in tolerations keys must use double quotes

Node Selector

Limit the pod to get scheduled to one kind of node only

Lable the node
Set the nodeSelector

Note there is no OR or NOT conditions, use node affinity instead

Node Affinity

Limit the pod to get scheduled to one or more particular nodes

Lable the node
Set the nodeAffinity
operators: In, NotIn, Exists, DoesNotExist, Gt, Lt
3 types

Combines the Taint/Toleration with NodeSelector or NodeAffinity to cover the scenarios

Resources

Request

The scheduling base on the resource requests
By default, k8s assumes a pod requires 0.5 cpu and 256Mi memory

Limit

By default, k8s limit a pod to 1 cpu and 512Mi memory
When a pod try to exceed resources beyond the limit
- cpu: k8s throttles the cpu won’t kill
- memory: k8s kill the pod with OOM

Static Pods

Use in creating control plane components (kube admin tools)

Without the intervention from the kube-api server, the kubelet can manage a node independently by monitor config files in the file system, and be able to create, recreate, update and delete the POD only object

–pod-manifest-path=/etc/Kubernetes/manifest
–config=kubeconfig.yaml (staticPodPath)

While the static pod created, the kube-api only get a readable mirror and not have the ability to update/delete it

Multiple Scheduler

copy the kube-scheduler configs from /etc/kubernetes/manifests
rename the scheduler --scheduler-name
if one master nodes with multiple scheduler:
- set the --leader-elect=false
if multiple masters with multiple scheduler, only one scheduler can active at a time
- set the --leader-elect=true
- set the --lock-object-name to differentiate the custom scheduler from default if multiple master
specify the scheduler for pod by schedulerName

2019 年 9 月 27 日2020 年 1 月 6 日
Ops

Cluster Architecture

Master Node

ETCD cluster
kube-scheduler
kube-controller-manager

These components communicate via kube-api server

Worker Node

container runtime engine, e.g. Docker, Rocket, ContainerD
kubelet: agent that runs and listen for instructions from kube-api
containers

The services deploy within worker nodes communicate with each other via kube-proxy

Objectives

ETCD

a distributed reliable key-value store
client commuications on port 2379
server to server on port 2380

kube-api

primary management component
setup:
1. using kube-admin tools
  - deploy kube-api as a pod in kube-system namespace
  - the manifests is at /etc/kubernetes/manifests/kube-apiserver.yaml
  - the options is at /etc/systemd/system/kube-apiserver.service
  - search for kube-apiserver process on master node
    
    1
    2
    
    ps -aux | grep kube-apiserver
example: apply deployment using kubectl
1. authenticates user
validate the HTTP requests

the kube-scheduler monitored the changes from the kube-api, then:

retrieve the node information from kube-api

schedule the pod to some node through kube-api to kubelet

update the pod info to ETCD

kube-controller-manager

continuously monitors the state of components
the controllers packages into a single process called Kube-Controller-Manager, which includes:
1. deployment-controller, cronjob, service-account-controller …
2. namespace-controller, job-contorller, node-controller …
3. endpoint-controller, replicaset, replication-controller(replica set) …
remediate situation

kube-scheduler

decide which pod goes to which node
1. filter nodes
2. rank nodes

kubelet

follow the instruction from kube-scheduler to controll the container runtime engine (e.g. docker) that run or remove a container
using kube-admin tools to deploy cluster, the kubelet are not installed by default in worker nodes, need intstall manually

kube-proxy

runs on each nodes in the cluster
create iptables rules on each nodes to forward traffic heading to the IP of the services to the IP of the actual pods
kube-admin tool deploy kube-proxy as daemonset in each nodes

pod

the container are encapsulated into a pod
is a single instance of an application, the smallest object in k8s
containers in same pod shares storages and network namespaces, created and removed in the same time
multi-container pod is rare use case

ReplicationController

apiVersion support in v1
the process to monitor the pods
maintain the HA and specified number of pods that running on all nodes
only care about the pod which RestartPolicy is set to Always
scalable and replacable application should be managed by the controller
use cases: rolling updates, multiple release tracks (multiple replication controller replica the same pod but using different labels)

ReplicaSets

next generation of ReplicationController
api version support in apps/v1
enhance the filtering in .spec.selector (the major difference)
be aware of the non-template pod that has same lables
using Deployment as a replacement is recommended, it own and manage its ReplicaSets

Deployment

provide replication vis replicaset and other:
- rolling update
- rollout
- pause and resume

Namespace

namespaces created at cluster creation
1. kube-system
kube-public

default

each namespace can be assigned quota of resources
a DNS entry with SERVICE_NAME.NAMESPACE.svc.cluster.local format is automatically created when at service creation
1. the cluster.local is the default domain name of the cluster
permanently config the namespace

1
2

kubectl config set-context $(kubectl config current-context) --namespace=$NAMESPACE

2019 年 8 月 17 日2019 年 8 月 23 日
Ops

Redis

compare to memcached

support persistant volume
- RDB
- AOF
support multiple data types
pub/sub

commands

redis-cli: command line interface
redis-sentinel: cluster managing tool
redis-server: run server
redis-benchmark: stress testing
redis-check-aof: check AOF
redis-check-dump: check RDB

configuration

Use redis.conf. Docker official redis image not contain this file. Mount it yourself or through redis-server arguments.

types

String: get, set, mget, mset
Integer: incr, decr, setbit
List: lpush, lrange, lpop
Hash Map: hset, hget, hmset, hmget
Set: sadd, smember, sdiff, sinter, sunion

use docker

Before start

To connect a container, you need to know the name and the port, in the associated networks to be able to discover the service.

There is no DNS resolution in docker deault bridge network. In default network, you need to specify --link to connect the containers. The --link is a legacy feature.

Therefore, create a user-defined network is recommanded, it provide automatic DNS resolution.

Create a bridge newrok

1 2	docker network create test

Run a redis instance in user-defined network

1 2	docker run --name redis -d --network test redis:4.0

Run a redis-cli connect to the redis instance

1 2	docker run -it --network test redis:4.0 redis-cli -h redis

Transaction

all commands are executed as a single isolated operation, serialized and executed sequentially
atomic: all failed or all succeed

MULTI: open a transaction and always return OK
EXEC: execute commands in transaction
DISCARD: flush commands and exit transaction
WATCH: check and set, if watched key changes, not execute

Errors

before EXEC: e.g. syntax error
after EXEC: e.g. value error

The pipeline discarding the transaction automatically if there was an error during the command queueing

… To be continued

Memcached

2019 年 7 月 13 日2019 年 10 月 6 日
Ops

Memcache

Store and retrieve data in memory(not persistent) base on specific hash function.

concepts

Slab: allocate as many pages as the ones available
Page: a memory area of default 1MB which contains as many chunks
Chunk: minimum allocated space for a single item
LRU: least recently used list

ref: Journey to the centre of memcached

we could say that we would run out of memory when all the available pages are allocated to slabs

memcached is designed to evict old/unused items in order to store new ones

every item operation (get, set, update or remove) requires the item in question to be locked

memcached only tries to remove the first 5 items of the LRU — after that it simply gives up and answers with OOM (out of memory)

2019 年 4 月 25 日2020 年 10 月 19 日
Ops

Something you need to know first: Binary Odometer

10.1.1.254 + 1 = 10.1.1.255
10.1.1.255 + 1 = 10.1.2.0
10.1.2.0 + 1 = 10.1.2.1

in reverse:

10.1.2.0 – 1 = 10.1.1.255

Example 1

172.16.35.123/20 or 172.16.35.123 with the mask 255.255.240.0

Binary Method

image alt

Quick Method

Figure out the subnets:

network and host split in third octect
subtract: 256 – 240 = 16, it means that network are incrementing in values of 16: 0, 16, 32, 48…
35 in the range of 32 and 48, so 172.16.35.123 is on subnet 172.16.32.0; next subnet is 172.16.48.0

First subnet = 172.16.32.0

Next subnet = 172.16.48.0

Broadcast address = next subnet – 1

1 2	172.16.48.0 - 1 = 172.16.47.255

First host = Subnet + 1

1 2	172.16.32.0 + 1 = 172.16.32.1

Last host = Broadcast – 1

1 2	172.16.47.255 - 1 = 172.16.47.254

Subnetting

Class A subnetting (255.0.0.0) support 1677214 (2^24) host per network, that way too much
Class B subnetting (255.255.0.0) support 16382 (2^16) host per network, that way too much
Class C subnetting (255.255.255.0) support 254 (2^8) host, more likely we subnet down to at least 254 hosts or even further

If you subnetting a network only has 2 hosts, you can subnet with (255.255.255.254) or CIDR as /31

Network, host number

Networks: 2^(network bits)
- one allocate for the subnet
- one allocate for the broadcast
Hosts: 2^(host bits) – 2

Subnetting to be short

“stealing” or “taking away” bits from the host portion of an address, and
allocating those bits to network portion

Example 2

Origin network 10.128.192.0/18 need at least 30 subnets as many hosts as possible

draw the line with /18 to split network and host
2^5 > 30, need 5 subnet bit, draw the line to split subnet and host
network/subnet portion is 8+8+7=23 bits, host portion is 32-23=9 bits

First subnet: 10.128.192.0/23
Second subnet: 10.128.194.0/23
Last subnet: 10.128.254.0/23

Ops

Kubernetes Short Notes(3)

Cluster Maintainance

OS Upgrade

Cluster Upgrade

Backup and Restore

Kubernetes Short Notes (2)

Scheduling

Manual Scheduling

Labeling

Annotations

Restriction

Resources

Static Pods

Multiple Scheduler

Kubernetes Short Notes (1)

Cluster Architecture

Objectives

Redis

Redis

compare to memcached

commands

configuration

types

use docker

Transaction

Memcached

Memcache

concepts

IP Subnetting

Example 1

Binary Method

Quick Method

Subnetting

Network, host number

Subnetting to be short

Example 2