Kubernetes Short Notes(4)

tags: `k8s`

Networking

Networking for Linux Basics

Network Switch

A switch is a device in a computer network that connects other devices together, can only enable a communication within a network

Host A(192.168.1.10)[eth0] &harr Switch(192.168.1.0) &harr [eth0]Host B(192.168.1.11)

# For Network A

$ ip link

$ ip addr add 192.168.1.10/24 dev eth0 # set a ip addr for interface eth0

# For Network B

$ ip link

$ ip addr add 192.168.1.11/24 dev eth0

# Test

$ ping 192.168.1.11

Router

A router is a device/service that provides the function of routing IP packets between networks

Switch(192.168.1.0) <–> [192.168.1.1]Router[192.168.2.1] <–> Switch(192.168.2.0)

Route/Gateway

A gateway (in network terms) is a router that describes the function for connectivity

# For Network A

$ ip route add 192.168.2.0/24 via 192.168.1.1

# For Network B

$ ip route add 192.168.1.0/24 via 192.168.2.1

Default Gateway

If none of these forwarding rules in the routing table is appropriate for a given destination address, the default gateway is chosen as the default router of last resort

1 2	$ ip route show default

Forwording packets between interfaces

By default in linux, packets are not forwarded from one interface to the next, for security reasons

Explicity allow it

1 2	echo 1 > /proc/sys/net/ipv4/ip_forward

Persists the settings

1 2	net.ipv4.ip_forward = 1

DNS

Translate host name to IP address by configure the /etc/hosts

When a environment has too many entries and IP address are not persistent, we need a DNS server

$ cat /etc/resolv.conf

nameserver 192.168.1.100

The host will lookup an entry in /etc/hosts first, then lookup in the DNS. This order can be changed by configure file /etc/nsswitch.conf

$ cat /etc/nsswitch.conf

passwd: files

group: files

shadow: files

gshadow: files

hosts: files dns

networks: files

protocols: db files

services: db files

ethers: db files

rpc: db files

netgroup: nis

You can configure the DNS server to forward unknown host name to the public name server in the Internet, for example reach www.google.com

private DNS → Root DNS → .com DNS → google DNS → cache the result

When looking for a host in the same domain, we want to simple use the host name not the full name, such as using web not web.mycompany.com, therefore we specify the domain name you want to append in /etc/resolv.conf

$ cat /etc/resolv.conf

search mycompany.com

There are records stores in DNS with specific types:

A: ipv4
AAAA: ipv6
CNAME: name to name mapping

You can use tools like nslookup, dig to debug, note that nslookup only query from dns, not files

There are plenty DNS solutions, such as CoreDNS, except configure from files, CoreDNS supports other ways of configuring DNS entries through plugins like kubernetes

Network Namespace

A namespace is a way of scoping a particular set of identifiers

Linux provides namespaces for networking and processes, if a process is running within a process namespace, it can only see and communicate with other processes in the same namespace

Linux starts up with a default network namespace

Each network namespace has its own routing table and has its own set of iptables

# Create namespace

ip netns add red

# List namespace

ip netns list

# List interface

ip link

# List interface in namespace

ip netns exec red ip link

# or

ip -n red link

Connect namespaces together using a virtual Ethernet pair (or virtual cable, pipe)

# Create veth pair

$ ip link add veth-red type veth peer name veth-blue

# Attach each interface to the appropriate namespace

$ ip link set veth-red netns red

$ ip link set veth-blue netns blue

# Assign IP to each namespaces

$ ip -n red addr add 192.168.15.1 dev veth-red

$ ip -n blue addr add 192.168.15.2 dev veth-blue

# Bring up the interface for each device within the respective namespace

$ ip -n red link set veth-red up

$ ip -n blue link set veth-blue up

# List ARP table to see neighbor

$ ip netns exec red arp

# Ping across namespace

$ ip netns exec red ping 192.168.15.2

When there more of namespaces need connected, use a virtial switch to create a virtial network. There few solutions:

Linux Bridge
Open vSwitch

# Create a virtial switch interface

$ ip link add v-net-0 type bridge

# Bring the interface up

$ ip link set dev v-net-0 up

# Create cables for each namespace to connect to the bridge

$ ip link add veth-red type veth peer name veth-red-br

$ ip link add veth-blue type veth peer name veth-blue-br

# Attach one end to the appropriate namespace

$ ip link set veth-red netns red

$ ip link set veth-blue netns blue

# Attach the other end to the bridge

$ ip link set veth-red-br master v-net-0

$ ip link set veth-blue-br master v-net-0

# Assign IP to each namespaces

$ ip -n red addr add 192.168.15.1 dev veth-red

$ ip -n blue addr add 192.168.15.2 dev veth-blue

# Bring up the interface for each device within the respective namespace

$ ip -n red link set veth-red up

$ ip -n blue link set veth-blue up

# Assign IP address to the bridge (since it’s just another interface on the host)

$ ip addr add 12.168.15.3/24 dev v-net-0

# Ping accross namespaces

$ ip netns exec red ping 192.168.15.2

When a private virtual network need to reach the outer network, it need a gateway, the host is the gateway

1 2	$ ip netns exec red ip route add 192.168.1.0/24 via 192.168.15.3

For destination network to response, enable NAT on host acting as a gateway.

Add a new rule in the NAT IP table in the POSTROUTING chain to masquerade or replace the from address on all packets coming from the source network 192.168.15.0 with its own IP address.

Thus anyone receiving these packets outside the network will think that they are coming from the host and not from within the namespaces

1 2	$ iptables -t nat -A POSTROUTING -s 192.168.15.0/24 -j MAS

Add a route using default gateway to outside world

1 2	$ ip netns exec red ip route add default via 192.168.15.3

For outside world to reach the namespace in a private network, add a port forwarding rule using IP tables to say any traffic coming to port 80 on the localhost is to be forwarded to port 80 on the IP assigned to the namespace

$ iptables \

-t nat \

-A PREROUTING \

--dport 80 \

--to-destination 192.168.15.1:80 \

-j DNAT

Docker Networking

none: disable the networking stack on a container
host: remove network isolation between the container and the Docker host, and use the host’s networking directly
bridge: connect containers to the same bridge network to communicate

When Docker is installed on the host it creates an internal private network called bridge by default. On the host the network is created by the name docker0.

$ ip link | grep docker0

3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default

$ ip addr | grep docker0

3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default

inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0

As mentioned above, the docker0 is the virtual switch for the virtual docker network, it’s done by the same approache:

1 2	$ ip link add docker0 type bridge

When we run a container, it create a namespace and connect to the bridge

# Let's run a container!

$ docker run -d nginx:alpine

$ docker inspect <container ID>

"NetworkSettings": {

"Bridge": "",

"SandboxID": "e9028bfa6326bcb02264a5bfc02bea7ae714d6edd4a5b445db3e2192c82e4df9",

"SandboxKey": "/var/run/docker/netns/e9028bfa6326",

# List the created namespace

$ sudo ip netns

e9028bfa6326 (id: 0)

# The host end veth cable

$ ip link

43: veth1de5d15@if42: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP mode DEFAULT group default

link/ether 8a:cd:37:b9:3a:e3 brd ff:ff:ff:ff:ff:ff link-netnsid 0

# The container end veth cable

$ sudo ip -n e9028bfa6326 link

44: eth0@if45: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default

link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0

# Show the ip assign to the container

$ sudo ip -n e9028bfa6326 addr

44: eth0@if45: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default

link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0

inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0

valid_lft forever preferred_lft forever

Port forwarding outside world traffic to container through host

1 2	$ docker run -d -p 8080:80 nginx:alpine

To do that, Docker adds the rule to the docker chain and sets destination to include the containers IP

$ iptables \

-t nat \

-A DOCKER \

-j DNAT \

--dport 8080 \

--to-destination <container IP>:<container port>

List the rules that docker create in iptable

$ iptables -nvL -t nat

Chain DOCKER (2 references)

pkts bytes target prot opt in out source destination

0 0 RETURN all -- docker0 * 0.0.0.0/0 0.0.0.0/0

0 0 DNAT tcp -- !docker0 * 0.0.0.0/0 0.0.0.0/0 tcp dpt:8080 to:172.17.0.2:80

List the ports listened on the host

$ sudo netstat -nplt

Active Internet connections (only servers)

Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name

tcp6 0 0 :::8080 :::* LISTEN 9421/docker-proxy

Networking in Kubernetes

Container Network Interface(CNI)

… many container runtimes and orchestrators will seek to solve the same problem of making the network layer pluggable. To avoid duplication, we think it is prudent to define a common interface between the network plugins and container execution …

For container runtime:

Container Runtime must create network namespace
Identify network the container must attach to
Container Runtime to invoke Network Plugin (e.g. bridge) when container is added/deleted
JSON format configuration

For plugin:

Must support command line arguments ADD/DEL/CHECK
Must support parameters container ID, network ns …
Must manage IP Address assignment to PODs
Must return results in a specific format

CNI support these plugins:

BRIDGE
VLAN
IPVLAN
MACVLAN
WINDOWS
DHCP
host-local

Other plugins are support by organizations, all of these container runtimes implement CNI standards:

weavework
flannel
cilium
Vmware NSX
Calico
Infobox

Docker does not implement CNI, Docker has its own set of standards known as Container Network Model(CNM) which is another standard that aims at solving container networking challenges similar to CNI but with some differences. Due to the differences these plugins don’t natively integrate with Docker

You’ll need to work around yourself if you try to run docker container with network plugins which follows CNI, for example create a docker container without any network configuration and then manually invoke the bridge plugin yourself. That is pretty much how kubernetes does it.

Cluster Node Networking

Each node in cluster must have at least one interface and an address configured. The hosts must have a unique hostname set, a unique MAC address.

Some port use by control planes need to be opened:

2379 on master node for ETCD
2380 on multiple master nodes for ETCD client
6443 on master node for kube-api
10250 on master/worker nodes for kubelet
10251 on master node for kube-scheduler
10252 on master node for kube-controller-manager
30000-32767 on worker nodes for NodePort services

Ref: https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/#check-required-ports

Pod Layer Networking

Kubernetes does not come with a built-in solution for POD networking, it expects you to implement a network solution that fit a laid out, clearly requirements for POD networking, such as:

every POD should have its own unique IP
every POD should be able to reach every other POD within node
every POD should be able to reach every other POD across the nodes in the cluster

A networking configuration script should meet CNI’s standard, then execute by kubelet to start a container

kubelet
↓
–cni-conf-dir=/etc/cni/net.d
↓
–cni-bin-dir=/etc/cni/bin
↓
./net-script.sh add

CNI in Kubernetes

The CNI plugin is invoked by the component within Kubernetes that is responsible for creating containers: kubelet

Find the CNI binary and configuration directories

$ ps -aux | grep kubelet

root 2235 2.9 4.6 746620 94640 ? Ssl 16:11 0:03 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-driver=cgroupfs --cni-bin-dir=/opt/cni/bin --cni-conf-dir=/etc/cni/net.d --netw

List the CNI plugins

$ ls /opt/cni/bin

bridge host-local macvlan sample weave-ipam

dhcp ipvlan portmap tuning weave-net

flannel loopback ptp

View the configuration file

$ ls /etc/cni/net.d

10-bridge.conf

$ cat /etc/cni/net.d/10-bridge.conf

{

"cniVersion": "0.2.0",

"name": "mynet",

"type": "bridge",

"bridge": "cni0",

"isGateWay": true,

"ipMasq": true,

"ipam": {

"type": "host-local",

"sub-net": "10.22.0.0/16",

"routes": [

{ "dst": "0.0.0.0/0" }

]

}

The isGateway defines whether the bridge network interface should get an IP address assigned so it can act as a gateway.

The ipMasquerade defines if a NAT rule should be added for IP masquerading.

The IPAM section defines IPAM configuration, where you specify the subnet or the range of IP addresses that will be assigned to pods and any necessary routes. The type host-local indicates that the IP addresses are managed locally on this host. Unlike a DHCP server maintaining it remotely. The type can also be set to DHCP to configure an external DHCP server.

Weave Net

When the weave CNI plugin is deployed on a cluster, it deploys an agent or service on each node. They communicate with each other to exchange information regarding the nodes and networks and PODs within them.

Each agent or peer stores a topology of the entire setup, that way they know the pods and their IPs on the other nodes.

Weave creates its own bridge on the nodes and names it weave. Then assigns IP address to each network.

What path a packet takes to reach destination depends on the route configured on the container. Weave makes sure that PODs get the correct route configured to reach the agent. And the agent then takes care of other PODs.

When a packet is sent from one pod to another on another node, weave intercepts the packet and identifies that it’s on a separate network. Weave then encapsulates this packet into a new one with new source and destination and sends it across the network. On the other side, the other weave agent retrieves the packet, decapsulates and routes the packet to the right POD.

The easier way to deploy Weave is deploy it as a DaemonSet in the cluster, the weave POD will runs on all the nodes:

$ kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"

serviceaccount/weave-net created

clusterrole.rbac.authorization.k8s.io/weave-net created

clusterrolebinding.rbac.authorization.k8s.io/weave-net created

role.rbac.authorization.k8s.io/weave-net created

rolebinding.rbac.authorization.k8s.io/weave-net created

daemonset.extensions/weave-net created

Inspect the configuration file

$ cat /etc/cni/net.d/10-weave.conf

{

"name": "weave",

"type": "weave-net",

"hairpinMode": true

}

Inspect the weave bridge interface

$ ip link | grep weave

6: weave: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue state UP mode DEFAULT group default qlen 1000

10: vethwe-bridge@vethwe-datapath: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue master weave state UP mode DEFAULT group default

13: vethwepldab32dc@if12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue master weave state UP mode DEFAULT groupdefault

15: vethweplcf123c9@if14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue master weave state UP mode DEFAULT groupdefault

Inspect the IP address range the weave network use

$ ip addr show weave

6: weave: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue state UP group default qlen 1000

link/ether ce:13:79:1a:ed:6c brd ff:ff:ff:ff:ff:ff

inet 10.44.0.0/12 brd 10.47.255.255 scope global weave

valid_lft forever preferred_lft forever

inet6 fe80::cc13:79ff:fe1a:ed6c/64 scope link

valid_lft forever

IP Address Management(IPAM)

DHCP
static
host-local

stores the state locally on the host filesystem, therefore ensuring uniqueness of IP addresses on a single host.
other, such as weave’s ipm

by default Weave allocates the IP range 10.32.0.0/12 for the entire network, that gives the network IP from 10.32.0.1 to 10.47.255.254 to use on PODs, and the weave peers decide to split the IP address equally and assigns one portion to each node

Service Networking

Service is a cluster wide concept, there is no server or service really listening on the IP of the service, there is no processes or namespaces or interfaces for a service, it’s just a virtual object

When we create a service object in kubernetes, it is assigned an IP address from a pre-defined range. The kube-proxy component that running on each node creates forwarding rules for the service IP, forwarding traffics from the service IP to the POD IP

The kube-proxy supports different ways to create these rules (proxy-mode)

userspace
ipvs
iptables (default)

Inspect the service IP range

$ ps -aux | grep kube-api

root 1943 3.8 15.4 399512 316232 ? Ssl 12:08 1:02 kube-apiserver --service-cluster-ip-range=10.96.0.0/12

...

Inspect the proxy mode

$ kubectl logs kube-proxy-ffncj -n kube-system

W0105 12:09:14.662118 1 server_others.go:287] Flag proxy-mode="" unknown, assuming iptables proxy

I0105 12:09:14.663302 1 server_others.go:140] Using iptables Proxier.W0105 12:09:14.698727 1 proxier.go:311] clusterCIDR not specified, unable to distinguish between internal and external traffic

I0105 12:09:14.698840 1 server_others.go:174] Tearing down inactive rules.I0105 12:09:14.808097 1 server.go:448] Version: v1.11.3

I0105 12:09:14.832692 1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072

I0105 12:09:14.833060 1 conntrack.go:52] Setting nf_conntrack_max to 131072I0105 12:09:14.833502 1 conntrack.go:83] Setting conntrack hashsize to 32768

I0105 12:09:14.833715 1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400I0105 12:09:14.833768 1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600

I0105 12:09:14.833966 1 config.go:102] Starting endpoints config controllerI0105 12:09:14.833995 1 controller_utils.go:1025] Waiting for caches to sync for endpoints config controller

I0105 12:09:14.834299 1 config.go:202] Starting service config controllerI0105 12:09:14.834328 1 controller_utils.go:1025] Waiting for caches to sync for service config controller

I0105 12:09:14.934171 1 controller_utils.go:1032] Caches are synced for endpoints config controllerI0105 12:09:14.934523 1 controller_utils.go:103

Cluster DNS Resolution

Kubernetes deploys a built-in DNS server by default, you need to do it by yourself if you setup Kubernetes manually. When a service is created, the kubernetes DNS creates a record for the service, it maps the service name to the IP address

For each namespace the DNS server creates a subdomain. All the services are grouped together into another subdomain called svc

All the services and PODs are grouped together into a root domain for the cluster, which is set to cluster.local by default

For example curl an web-service in namespace apps:

curl https://web-service

curl https://web-service.apps

curl https://web-service.apps.svc

# FQDN

curl https://web-service.apps.svc.cluster.local

Records for PODs are not created by default. But we can enable that explicitly. Once enabled, Records are created for pods as well. It does not use the POD name, it generates a name by replacing the dots in the IP address with dashes

| Hostname | Namespace | Type | Root | IP Address |
|————-|———–|——|—————|—————|
| web-service | apps | svc | cluster.local | 10.107.37.188 |
| 10-244-2-5 | apps | pod | cluster.local | 10.244.2.5 |

This is kind of how kubernetes does it

Except it does not create entries for PODs to map pod name to its IP address, it does that for services

Prior to version v1.12 the DNS implemented by kubernetes was known as kube-dns. With Kubernetes version 1.12 the recommended DNS server is CoreDNS

The CoreDNS is deploy as PODs (Deployment with 2 replicas) in a cluster that runs the CoreDNS executable.

$ kubectl exec coredns-78fcdf6894-clgxn ps -n kube-system

PID USER TIME COMMAND

1 root 0:03 /coredns -conf /etc/coredns/Corefile

21 root

$ kubectl exec coredns-78fcdf6894-clgxn cat /etc/coredns/Corefile -n kube-system

.:53 {

errors

health

kubernetes cluster.local in-addr.arpa ip6.arpa {

pods insecure

upstream

failthrough in-addr.arpa ip6.arpa

}

prometheus :9153

proxy . /etc/resolv.conf

cache 30

reload

}

This config file is passed to the POD as a ConfigMap object

1 2	$ kubectl get configmap -n kube-system

When we deploy CoreDNS solution, it also creates a service to make it available to other components within a cluster. The service is named as kube-dns by default. The kubelet is responsible for configuring the nameserver to the DNS server IP of the PODs

If you try to manually lookup the service using nslookup or the host command, it will return the FQDN of the service. It look up for the full name because the resolv.conf file also has a search entry which is set to default.svc.cluster.local, svc.cluster.local and cluster.local

Notice that it only has search entries for service. You need to specify the FQDN of the pod

1 2	$ host 10-244-2-5.default.pod.cluster.local

Ingress

Service node ports can only allocate high numbered ports which are greater than 30000

If using public cloud, we can use service type LoadBalancer to create a proxy server for the service. By doing that, Kubernetes sends a additional request to cloud provider to provision a network load balancer, configured to route traffic to the service ports

When your application scales, you need another proxy to redirect traffic to multiple load balancers, that may increase the complexity when you need to configure the firewall rules and SSL for each applications

That’s where Ingress comes in, Ingress is a layer 7 load balancer built-in to the cluster, it can helps users access applications using a single externally accessible URL that can configured to route to different services within the cluster base on the URL path and implement SSL security as well

There are solutions avaliable for ingress

GCE
Nginx
HAProxy
Trafik
Istio

GCE and Nginx are currently being supported and maintained by the Kubernetes project. An Ingress Controller is just a web server, which has additional intelligence to monitor the Kubernetes cluster for new definitions or ingress resources

To manually configure a nginx ingress controller, you will need to create these objectives:

Deployment: expose container port 80 and 443
Service: expose node port 80 and 443
ConfigMap: store nginx configuration files
ServiceAccount, Role, ClusterRole, RoleBinding: for permissions to re-configure at resource changes

You might want to deploy an additional service use for default backend, which means if traffic does not match any rules, it direct to this service as default

See the full configurations using helm chart nginx-ingress

Storage

Persistent Volume

Except storing a volume on the host, Kubernetes provide serveral type of storage solutions

NFS
GlusterFS
Flocker
Ceph
ScaleIO
AWS EBS
Azure Disk
Google Persistent Disk

Persistent Volume Claim

Administrators creates PV, and users creates PVC to use the PV, during the binding process Kubernetes tries to find the PV that has sufficient capacity as requested by the claim and any other request properties such as access modes, volume modes, storage class and selector

Note that a smaller claim may get bound to a larger volume if all the other criteria matches and there are no better options

There is a one to one relationship bewteen PV and PVC, no other claims can utilize the remaining capacity in the volume

Configure the field persistentVolumeReclaimPolicy to defined what action to perform to PV after a PVC deleted.

Retain (default)
Delete
Recycle

Kubernetes Short Notes(4)

tags: k8s

Networking

Networking for Linux Basics

Docker Networking

Networking in Kubernetes

Storage

Persistent Volume

Persistent Volume Claim

發佈留言 取消回覆

tags: `k8s`

發佈留言取消回覆