Kubernetes Short Notes(4)

Storage

Persistent Volume

Except storing a volume on the host, Kubernetes provide serveral type of storage solutions

  • NFS
  • GlusterFS
  • Flocker
  • Ceph
  • ScaleIO
  • AWS EBS
  • Azure Disk
  • Google Persistent Disk

Persistent Volume Claim

Administrators creates PV, and users creates PVC to use the PV, during the binding process Kubernetes tries to find the PV that has sufficient capacity as requested by the claim and any other request properties such as access modes, volume modes, storage class and selector

Note that a smaller claim may get bound to a larger volume if all the other criteria matches and there are no better options

There is a one to one relationship bewteen PV and PVC, no other claims can utilize the remaining capacity in the volume

Configure the field persistentVolumeReclaimPolicy to defined what action to perform to PV after a PVC deleted.

  • Retain (default)
  • Delete
  • Recycle

Networking

Networking for Linux Basics

Network Switch

A switch is a device in a computer network that connects other devices together, can only enable a communication within a network

Host A(192.168.1.10)[eth0] &harr Switch(192.168.1.0) &harr [eth0]Host B(192.168.1.11)

Router

A router is a device/service that provides the function of routing IP packets between networks

Switch(192.168.1.0) <–> [192.168.1.1]Router[192.168.2.1] <–> Switch(192.168.2.0)

Route/Gateway

A gateway (in network terms) is a router that describes the function for connectivity

Default Gateway

If none of these forwarding rules in the routing table is appropriate for a given destination address, the default gateway is chosen as the default router of last resort

Forwording packets between interfaces

By default in linux, packets are not forwarded from one interface to the next, for security reasons

Explicity allow it

Persists the settings

DNS

Translate host name to IP address by configure the /etc/hosts

When a environment has too many entries and IP address are not persistent, we need a DNS server

The host will lookup an entry in /etc/hosts first, then lookup in the DNS. This order can be changed by configure file /etc/nsswitch.conf

You can configure the DNS server to forward unknown host name to the public name server in the Internet, for example reach www.google.com

private DNS → Root DNS → .com DNS → google DNS → cache the result

When looking for a host in the same domain, we want to simple use the host name not the full name, such as using web not web.mycompany.com, therefore we specify the domain name you want to append in /etc/resolv.conf

There are records stores in DNS with specific types:

  • A: ipv4
  • AAAA: ipv6
  • CNAME: name to name mapping

You can use tools like nslookup, dig to debug, note that nslookup only query from dns, not files

There are plenty DNS solutions, such as CoreDNS, except configure from files, CoreDNS supports other ways of configuring DNS entries through plugins like kubernetes

Network Namespace

A namespace is a way of scoping a particular set of identifiers

Linux provides namespaces for networking and processes, if a process is running within a process namespace, it can only see and communicate with other processes in the same namespace

Linux starts up with a default network namespace

Each network namespace has its own routing table and has its own set of iptables

Connect namespaces together using a virtual Ethernet pair (or virtual cable, pipe)

When there more of namespaces need connected, use a virtial switch to create a virtial network. There few solutions:

  • Linux Bridge
  • Open vSwitch

image

When a private virtual network need to reach the outer network, it need a gateway, the host is the gateway

For destination network to response, enable NAT on host acting as a gateway.

Add a new rule in the NAT IP table in the POSTROUTING chain to masquerade or replace the from address on all packets coming from the source network 192.168.15.0 with its own IP address.

Thus anyone receiving these packets outside the network will think that they are coming from the host and not from within the namespaces

Add a route using default gateway to outside world

For outside world to reach the namespace in a private network, add a port forwarding rule using IP tables to say any traffic coming to port 80 on the localhost is to be forwarded to port 80 on the IP assigned to the namespace

Read more “Kubernetes Short Notes(4)”

發表留言

Kubernetes Short Notes(3)

Cluster Maintainance

OS Upgrade

Pod Eviction Timeout

When the nodes was down for more than 5 minute(default) then the pods are terminated; pod will recreate if has replicaset

Drain, Cordon, Uncordon

We’re not sure the node will come back online in 5 minutes, therefore we can drain the node.

After the drained node upgraded and come back, it still unschedulable, uncordon the node to make it schedulable.

Note that the previouse pods won’t be automatically reschedule back to the nodes.

Cluster Upgrade

The core control plane components’s version can be different, but should follow certain rules:

  • the kube-api is the primary component, none of the other components’s version must not be higher than the kube-api
  • the components can be lower in 1-2 versions
    • kube-api: x
    • Controlloer-manager, kube-scheduler: x, x-1
    • kubelet, kube-proxy: x, x-1, x-2
  • the kubectl can be one version higher than kube-api: x+1, x, x-1

The kubernetes support only up to the recent 3 minor versions. The recommanded approch is to update one minor version at a time.

Update the cluster depend on how you deploy them:

  • cloud provider: few clicks at the UI
  • kubeadm: using upgrade argument (you should upgrade the kubeadm first!)
  • the hard way from scratch: manually upgrade components by yourself

Two major steps:

  1. upgrade master node, the control plane componets goes down, all management function are down, only the applications deploy on worker nodes keeps serving
  2. update worker nodes, with strategies:
    • upgrade all at once with downtimes
    • upgrade one at a time
    • create new nodes and remove the workloads, then finally remove old nodes

When you run command like kubectl get nodes, the VERSION is indicat the version of the kubelet

Backup and Restore

Master / Node DR

  • Cordon & drain
  • Provision replacement master / node

ETCD DR

Option: Backup resources

Saving objects as a copy by query the kube-api

Option: Backup ETCD

Making copies of the ETCD data directory

Or use the etcd command line tool

  1. Make a snap shot

    Remember to specify the certification files for authentication
  2. Stop kube-api
  3. Restore snapshot

    When ETCD restore from a backup, it initialize a new cluster configuration and configures the members of ETCD as new members to a new cluster. This is to prevent a new member from accidentally joining an existing cluster.
    For example, using a snapshot to provision a new etcd-cluster from testing purpose. You don’t want the members in the new test cluster to accidentally join the production cluster.

  4. Configure the etcd.service with new data directory and new cluster token

    During a restore, you must provide a new cluster token and the same initial cluster configuration

  5. Restart ETCD service
  6. Start kube-api

Persistant Volume DR

You can’t relay on kubernetes to for backing up and restore persistant volumes.

If you’re using cloud provider specific persistant volumes like EBS volumes, Azure managed disks or GCE persistent disks, you should use cloud provider snapshot APIs

Read more “Kubernetes Short Notes(3)”

發表留言

Kubernetes Short Notes (2)

Scheduling

Manual Scheduling

  • Bind the pod to node by nodeName property, before that, the pod stays in the Pending state

  • Manutal ways to bind:

    • specify the spec.nodeName, not updatable

    • create the Binding object

Labeling

Use to group and select the objects, for example a ReplicaSet object configs:

  • metadata.labels sets the ReplicaSet itself
  • spec.template.metadata.lables sets the Pod
  • spec.selector.matchLabels defines how ReplicaSet to discover the Pod

Annotations

Use to record other details for intergration purpose e.g. build info, contact details

Restriction

Taint/Toleration

Limit pods without tolerations cannot get scheduled to a tainted node

  • Taint the nodes

  • Set the pods’ tolerance, three behavior are avaliable if not tolerant:

    • NoSchedule

  • PreferNoSchedule: not guaranteed
  • NoExecute: new pods=NoSchedule, existed pods=evicted

Note the value in tolerations keys must use double quotes

Node Selector

Limit the pod to get scheduled to one kind of node only

  • Lable the node
  • Set the nodeSelector

Note there is no OR or NOT conditions, use node affinity instead

Node Affinity

Limit the pod to get scheduled to one or more particular nodes

  • Lable the node
  • Set the nodeAffinity
  • operators: In, NotIn, Exists, DoesNotExist, Gt, Lt
  • 3 types

Combines the Taint/Toleration with NodeSelector or NodeAffinity to cover the scenarios

Resources

Request

  • The scheduling base on the resource requests
  • By default, k8s assumes a pod requires 0.5 cpu and 256Mi memory

Limit

  • By default, k8s limit a pod to 1 cpu and 512Mi memory
  • When a pod try to exceed resources beyond the limit
    • cpu: k8s throttles the cpu won’t kill
    • memory: k8s kill the pod with OOM

Static Pods

Use in creating control plane components (kube admin tools)

Without the intervention from the kube-api server, the kubelet can manage a node independently by monitor config files in the file system, and be able to create, recreate, update and delete the POD only object

  • –pod-manifest-path=/etc/Kubernetes/manifest
  • –config=kubeconfig.yaml (staticPodPath)

While the static pod created, the kube-api only get a readable mirror and not have the ability to update/delete it

Multiple Scheduler

  • copy the kube-scheduler configs from /etc/kubernetes/manifests
  • rename the scheduler --scheduler-name
  • if one master nodes with multiple scheduler:
    • set the --leader-elect=false
  • if multiple masters with multiple scheduler, only one scheduler can active at a time
    • set the --leader-elect=true
    • set the --lock-object-name to differentiate the custom scheduler from default if multiple master
  • specify the scheduler for pod by schedulerName
Read more “Kubernetes Short Notes (2)” 發表留言

Kubernetes Short Notes (1)

Cluster Architecture

Master Node

  • ETCD cluster
  • kube-scheduler
  • kube-controller-manager

These components communicate via kube-api server

Worker Node

  • container runtime engine, e.g. Docker, Rocket, ContainerD
  • kubelet: agent that runs and listen for instructions from kube-api
  • containers

The services deploy within worker nodes communicate with each other via kube-proxy

Objectives

ETCD

  • a distributed reliable key-value store
  • client commuications on port 2379
  • server to server on port 2380

kube-api

  • primary management component

  • setup:

    1. using kube-admin tools

      • deploy kube-api as a pod in kube-system namespace

      • the manifests is at /etc/kubernetes/manifests/kube-apiserver.yaml

      • the options is at /etc/systemd/system/kube-apiserver.service

      • search for kube-apiserver process on master node

  • example: apply deployment using kubectl

    1. authenticates user

  • validate the HTTP requests
  • the kube-scheduler monitored the changes from the kube-api, then:
    • retrieve the node information from kube-api

  • schedule the pod to some node through kube-api to kubelet

  • update the pod info to ETCD
  • kube-controller-manager

    • continuously monitors the state of components
    • the controllers packages into a single process called Kube-Controller-Manager, which includes:
      1. deployment-controller, cronjob, service-account-controller …
      2. namespace-controller, job-contorller, node-controller …
      3. endpoint-controller, replicaset, replication-controller(replica set) …
    • remediate situation

    kube-scheduler

    • decide which pod goes to which node
      1. filter nodes
      2. rank nodes

    kubelet

    • follow the instruction from kube-scheduler to controll the container runtime engine (e.g. docker) that run or remove a container
    • using kube-admin tools to deploy cluster, the kubelet are not installed by default in worker nodes, need intstall manually

    kube-proxy

    • runs on each nodes in the cluster
    • create iptables rules on each nodes to forward traffic heading to the IP of the services to the IP of the actual pods
    • kube-admin tool deploy kube-proxy as daemonset in each nodes

    pod

    • the container are encapsulated into a pod
    • is a single instance of an application, the smallest object in k8s
    • containers in same pod shares storages and network namespaces, created and removed in the same time
    • multi-container pod is rare use case

    ReplicationController

    • apiVersion support in v1
    • the process to monitor the pods
    • maintain the HA and specified number of pods that running on all nodes
    • only care about the pod which RestartPolicy is set to Always
    • scalable and replacable application should be managed by the controller
    • use cases: rolling updates, multiple release tracks (multiple replication controller replica the same pod but using different labels)

    ReplicaSets

    • next generation of ReplicationController
    • api version support in apps/v1
    • enhance the filtering in .spec.selector (the major difference)
    • be aware of the non-template pod that has same lables
    • using Deployment as a replacement is recommended, it own and manage its ReplicaSets

    Deployment

    • provide replication vis replicaset and other:
      • rolling update
      • rollout
      • pause and resume

    Namespace

    • namespaces created at cluster creation

      1. kube-system

    • kube-public
    • default
    • each namespace can be assigned quota of resources

    • a DNS entry with SERVICE_NAME.NAMESPACE.svc.cluster.local format is automatically created when at service creation

      1. the cluster.local is the default domain name of the cluster

    • permanently config the namespace

    Read more “Kubernetes Short Notes (1)” 發表留言

    Redis

    Redis

    compare to memcached

    • support persistant volume
      • RDB
      • AOF
    • support multiple data types
    • pub/sub

    commands

    • redis-cli: command line interface
    • redis-sentinel: cluster managing tool
    • redis-server: run server
    • redis-benchmark: stress testing
    • redis-check-aof: check AOF
    • redis-check-dump: check RDB

    configuration

    Use redis.conf. Docker official redis image not contain this file. Mount it yourself or through redis-server arguments.

    types

    • String: get, set, mget, mset
    • Integer: incr, decr, setbit
    • List: lpush, lrange, lpop
    • Hash Map: hset, hget, hmset, hmget
    • Set: sadd, smember, sdiff, sinter, sunion

    use docker

    Before start

    To connect a container, you need to know the name and the port, in the associated networks to be able to discover the service.

    There is no DNS resolution in docker deault bridge network. In default network, you need to specify --link to connect the containers. The --link is a legacy feature.

    Therefore, create a user-defined network is recommanded, it provide automatic DNS resolution.

    Create a bridge newrok

    Run a redis instance in user-defined network

    Run a redis-cli connect to the redis instance

    Transaction

    all commands are executed as a single isolated operation, serialized and executed sequentially
    atomic: all failed or all succeed

    • MULTI: open a transaction and always return OK
    • EXEC: execute commands in transaction
    • DISCARD: flush commands and exit transaction
    • WATCH: check and set, if watched key changes, not execute

    Errors

    • before EXEC: e.g. syntax error
    • after EXEC: e.g. value error

    The pipeline discarding the transaction automatically if there was an error during the command queueing

    … To be continued

    發表留言

    Memcached

    Memcache

    Store and retrieve data in memory(not persistent) base on specific hash function.

    concepts

    • Slab: allocate as many pages as the ones available

    • Page: a memory area of default 1MB which contains as many chunks

    • Chunk: minimum allocated space for a single item

    • LRU: least recently used list

    ref: Journey to the centre of memcached

    we could say that we would run out of memory when all the available pages are allocated to slabs

    memcached is designed to evict old/unused items in order to store new ones

    every item operation (get, set, update or remove) requires the item in question to be locked

    memcached only tries to remove the first 5 items of the LRU — after that it simply gives up and answers with OOM (out of memory)

    Read more “Memcached” 發表留言