tags: k8s
Networking
Networking for Linux Basics
Network Switch
A switch is a device in a computer network that connects other devices together, can only enable a communication within a network
Host A(192.168.1.10)[eth0] &harr Switch(192.168.1.0) &harr [eth0]Host B(192.168.1.11)
1 2 3 4 5 6 7 8 9 |
# For Network A $ ip link $ ip addr add 192.168.1.10/24 dev eth0 # set a ip addr for interface eth0 # For Network B $ ip link $ ip addr add 192.168.1.11/24 dev eth0 # Test $ ping 192.168.1.11 |
Router
A router is a device/service that provides the function of routing IP packets between networks
Switch(192.168.1.0) <–> [192.168.1.1]Router[192.168.2.1] <–> Switch(192.168.2.0)
Route/Gateway
A gateway (in network terms) is a router that describes the function for connectivity
1 2 3 4 5 |
# For Network A $ ip route add 192.168.2.0/24 via 192.168.1.1 # For Network B $ ip route add 192.168.1.0/24 via 192.168.2.1 |
Default Gateway
If none of these forwarding rules in the routing table is appropriate for a given destination address, the default gateway is chosen as the default router of last resort
1 2 |
$ ip route show default |
Forwording packets between interfaces
By default in linux, packets are not forwarded from one interface to the next, for security reasons
Explicity allow it
1 2 |
echo 1 > /proc/sys/net/ipv4/ip_forward |
Persists the settings
1 2 |
net.ipv4.ip_forward = 1 |
DNS
Translate host name to IP address by configure the /etc/hosts
When a environment has too many entries and IP address are not persistent, we need a DNS server
1 2 3 |
$ cat /etc/resolv.conf nameserver 192.168.1.100 |
The host will lookup an entry in /etc/hosts
first, then lookup in the DNS. This order can be changed by configure file /etc/nsswitch.conf
1 2 3 4 5 6 7 8 9 10 11 12 13 |
$ cat /etc/nsswitch.conf passwd: files group: files shadow: files gshadow: files hosts: files dns networks: files protocols: db files services: db files ethers: db files rpc: db files netgroup: nis |
You can configure the DNS server to forward unknown host name to the public name server in the Internet, for example reach www.google.com
private DNS → Root DNS → .com DNS → google DNS → cache the result
When looking for a host in the same domain, we want to simple use the host name not the full name, such as using web not web.mycompany.com, therefore we specify the domain name you want to append in /etc/resolv.conf
1 2 3 |
$ cat /etc/resolv.conf search mycompany.com |
There are records stores in DNS with specific types:
- A: ipv4
- AAAA: ipv6
- CNAME: name to name mapping
You can use tools like nslookup
, dig
to debug, note that nslookup only query from dns, not files
There are plenty DNS solutions, such as CoreDNS, except configure from files, CoreDNS supports other ways of configuring DNS entries through plugins like kubernetes
Network Namespace
A namespace is a way of scoping a particular set of identifiers
Linux provides namespaces for networking and processes, if a process is running within a process namespace, it can only see and communicate with other processes in the same namespace
Linux starts up with a default network namespace
Each network namespace has its own routing table and has its own set of iptables
1 2 3 4 5 6 7 8 9 10 11 |
# Create namespace ip netns add red # List namespace ip netns list # List interface ip link # List interface in namespace ip netns exec red ip link # or ip -n red link |
Connect namespaces together using a virtual Ethernet pair (or virtual cable, pipe)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# Create veth pair $ ip link add veth-red type veth peer name veth-blue # Attach each interface to the appropriate namespace $ ip link set veth-red netns red $ ip link set veth-blue netns blue # Assign IP to each namespaces $ ip -n red addr add 192.168.15.1 dev veth-red $ ip -n blue addr add 192.168.15.2 dev veth-blue # Bring up the interface for each device within the respective namespace $ ip -n red link set veth-red up $ ip -n blue link set veth-blue up # List ARP table to see neighbor $ ip netns exec red arp # Ping across namespace $ ip netns exec red ping 192.168.15.2 |
When there more of namespaces need connected, use a virtial switch to create a virtial network. There few solutions:
- Linux Bridge
- Open vSwitch
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
# Create a virtial switch interface $ ip link add v-net-0 type bridge # Bring the interface up $ ip link set dev v-net-0 up # Create cables for each namespace to connect to the bridge $ ip link add veth-red type veth peer name veth-red-br $ ip link add veth-blue type veth peer name veth-blue-br # Attach one end to the appropriate namespace $ ip link set veth-red netns red $ ip link set veth-blue netns blue # Attach the other end to the bridge $ ip link set veth-red-br master v-net-0 $ ip link set veth-blue-br master v-net-0 # Assign IP to each namespaces $ ip -n red addr add 192.168.15.1 dev veth-red $ ip -n blue addr add 192.168.15.2 dev veth-blue # Bring up the interface for each device within the respective namespace $ ip -n red link set veth-red up $ ip -n blue link set veth-blue up # Assign IP address to the bridge (since it’s just another interface on the host) $ ip addr add 12.168.15.3/24 dev v-net-0 # Ping accross namespaces $ ip netns exec red ping 192.168.15.2 |
When a private virtual network need to reach the outer network, it need a gateway, the host is the gateway
1 2 |
$ ip netns exec red ip route add 192.168.1.0/24 via 192.168.15.3 |
For destination network to response, enable NAT on host acting as a gateway.
Add a new rule in the NAT IP table in the POSTROUTING chain to masquerade or replace the from address on all packets coming from the source network 192.168.15.0 with its own IP address.
Thus anyone receiving these packets outside the network will think that they are coming from the host and not from within the namespaces
1 2 |
$ iptables -t nat -A POSTROUTING -s 192.168.15.0/24 -j MAS |
Add a route using default gateway to outside world
1 2 |
$ ip netns exec red ip route add default via 192.168.15.3 |
For outside world to reach the namespace in a private network, add a port forwarding rule using IP tables to say any traffic coming to port 80 on the localhost is to be forwarded to port 80 on the IP assigned to the namespace
1 2 3 4 5 6 7 |
$ iptables \ -t nat \ -A PREROUTING \ --dport 80 \ --to-destination 192.168.15.1:80 \ -j DNAT |
Docker Networking
- none: disable the networking stack on a container
- host: remove network isolation between the container and the Docker host, and use the host’s networking directly
- bridge: connect containers to the same bridge network to communicate
When Docker is installed on the host it creates an internal private network called bridge by default. On the host the network is created by the name docker0.
1 2 3 4 5 6 |
$ ip link | grep docker0 3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default $ ip addr | grep docker0 3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0 |
As mentioned above, the docker0 is the virtual switch for the virtual docker network, it’s done by the same approache:
1 2 |
$ ip link add docker0 type bridge |
When we run a container, it create a namespace and connect to the bridge
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
# Let's run a container! $ docker run -d nginx:alpine $ docker inspect <container ID> "NetworkSettings": { "Bridge": "", "SandboxID": "e9028bfa6326bcb02264a5bfc02bea7ae714d6edd4a5b445db3e2192c82e4df9", "SandboxKey": "/var/run/docker/netns/e9028bfa6326", # List the created namespace $ sudo ip netns e9028bfa6326 (id: 0) # The host end veth cable $ ip link 43: veth1de5d15@if42: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP mode DEFAULT group default link/ether 8a:cd:37:b9:3a:e3 brd ff:ff:ff:ff:ff:ff link-netnsid 0 # The container end veth cable $ sudo ip -n e9028bfa6326 link 44: eth0@if45: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0 # Show the ip assign to the container $ sudo ip -n e9028bfa6326 addr 44: eth0@if45: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0 valid_lft forever preferred_lft forever |
Port forwarding outside world traffic to container through host
1 2 |
$ docker run -d -p 8080:80 nginx:alpine |
To do that, Docker adds the rule to the docker chain and sets destination to include the containers IP
1 2 3 4 5 6 7 |
$ iptables \ -t nat \ -A DOCKER \ -j DNAT \ --dport 8080 \ --to-destination <container IP>:<container port> |
List the rules that docker create in iptable
1 2 3 4 5 6 |
$ iptables -nvL -t nat Chain DOCKER (2 references) pkts bytes target prot opt in out source destination 0 0 RETURN all -- docker0 * 0.0.0.0/0 0.0.0.0/0 0 0 DNAT tcp -- !docker0 * 0.0.0.0/0 0.0.0.0/0 tcp dpt:8080 to:172.17.0.2:80 |
List the ports listened on the host
1 2 3 4 5 |
$ sudo netstat -nplt Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp6 0 0 :::8080 :::* LISTEN 9421/docker-proxy |
Networking in Kubernetes
Container Network Interface(CNI)
… many container runtimes and orchestrators will seek to solve the same problem of making the network layer pluggable. To avoid duplication, we think it is prudent to define a common interface between the network plugins and container execution …
For container runtime:
- Container Runtime must create network namespace
- Identify network the container must attach to
- Container Runtime to invoke Network Plugin (e.g. bridge) when container is added/deleted
- JSON format configuration
For plugin:
- Must support command line arguments ADD/DEL/CHECK
- Must support parameters container ID, network ns …
- Must manage IP Address assignment to PODs
- Must return results in a specific format
CNI support these plugins:
- BRIDGE
- VLAN
- IPVLAN
- MACVLAN
- WINDOWS
- DHCP
- host-local
Other plugins are support by organizations, all of these container runtimes implement CNI standards:
- weavework
- flannel
- cilium
- Vmware NSX
- Calico
- Infobox
Docker does not implement CNI, Docker has its own set of standards known as Container Network Model(CNM) which is another standard that aims at solving container networking challenges similar to CNI but with some differences. Due to the differences these plugins don’t natively integrate with Docker
You’ll need to work around yourself if you try to run docker container with network plugins which follows CNI, for example create a docker container without any network configuration and then manually invoke the bridge plugin yourself. That is pretty much how kubernetes does it.
Cluster Node Networking
Each node in cluster must have at least one interface and an address configured. The hosts must have a unique hostname set, a unique MAC address.
Some port use by control planes need to be opened:
- 2379 on master node for ETCD
- 2380 on multiple master nodes for ETCD client
- 6443 on master node for kube-api
- 10250 on master/worker nodes for kubelet
- 10251 on master node for kube-scheduler
- 10252 on master node for kube-controller-manager
- 30000-32767 on worker nodes for NodePort services
Ref: https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/#check-required-ports
Pod Layer Networking
Kubernetes does not come with a built-in solution for POD networking, it expects you to implement a network solution that fit a laid out, clearly requirements for POD networking, such as:
- every POD should have its own unique IP
- every POD should be able to reach every other POD within node
- every POD should be able to reach every other POD across the nodes in the cluster
A networking configuration script should meet CNI’s standard, then execute by kubelet to start a container
kubelet
↓
–cni-conf-dir=/etc/cni/net.d
↓
–cni-bin-dir=/etc/cni/bin
↓
./net-script.sh add
CNI in Kubernetes
The CNI plugin is invoked by the component within Kubernetes that is responsible for creating containers: kubelet
Find the CNI binary and configuration directories
1 2 3 |
$ ps -aux | grep kubelet root 2235 2.9 4.6 746620 94640 ? Ssl 16:11 0:03 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-driver=cgroupfs --cni-bin-dir=/opt/cni/bin --cni-conf-dir=/etc/cni/net.d --netw |
List the CNI plugins
1 2 3 4 5 |
$ ls /opt/cni/bin bridge host-local macvlan sample weave-ipam dhcp ipvlan portmap tuning weave-net flannel loopback ptp |
View the configuration file
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
$ ls /etc/cni/net.d 10-bridge.conf $ cat /etc/cni/net.d/10-bridge.conf { "cniVersion": "0.2.0", "name": "mynet", "type": "bridge", "bridge": "cni0", "isGateWay": true, "ipMasq": true, "ipam": { "type": "host-local", "sub-net": "10.22.0.0/16", "routes": [ { "dst": "0.0.0.0/0" } ] } } |
The isGateway
defines whether the bridge network interface should get an IP address assigned so it can act as a gateway.
The ipMasquerade
defines if a NAT rule should be added for IP masquerading.
The IPAM
section defines IPAM configuration, where you specify the subnet or the range of IP addresses that will be assigned to pods and any necessary routes. The type host-local indicates that the IP addresses are managed locally on this host. Unlike a DHCP server maintaining it remotely. The type can also be set to DHCP to configure an external DHCP server.
Weave Net
When the weave CNI plugin is deployed on a cluster, it deploys an agent or service on each node. They communicate with each other to exchange information regarding the nodes and networks and PODs within them.
Each agent or peer stores a topology of the entire setup, that way they know the pods and their IPs on the other nodes.
Weave creates its own bridge on the nodes and names it weave. Then assigns IP address to each network.
What path a packet takes to reach destination depends on the route configured on the container. Weave makes sure that PODs get the correct route configured to reach the agent. And the agent then takes care of other PODs.
When a packet is sent from one pod to another on another node, weave intercepts the packet and identifies that it’s on a separate network. Weave then encapsulates this packet into a new one with new source and destination and sends it across the network. On the other side, the other weave agent retrieves the packet, decapsulates and routes the packet to the right POD.
The easier way to deploy Weave is deploy it as a DaemonSet in the cluster, the weave POD will runs on all the nodes:
1 2 3 4 5 6 7 8 9 |
$ kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')" serviceaccount/weave-net created clusterrole.rbac.authorization.k8s.io/weave-net created clusterrolebinding.rbac.authorization.k8s.io/weave-net created role.rbac.authorization.k8s.io/weave-net created rolebinding.rbac.authorization.k8s.io/weave-net created daemonset.extensions/weave-net created |
Inspect the configuration file
1 2 3 4 5 6 7 |
$ cat /etc/cni/net.d/10-weave.conf { "name": "weave", "type": "weave-net", "hairpinMode": true } |
Inspect the weave bridge interface
1 2 3 4 5 6 |
$ ip link | grep weave 6: weave: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue state UP mode DEFAULT group default qlen 1000 10: vethwe-bridge@vethwe-datapath: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue master weave state UP mode DEFAULT group default 13: vethwepldab32dc@if12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue master weave state UP mode DEFAULT groupdefault 15: vethweplcf123c9@if14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue master weave state UP mode DEFAULT groupdefault |
Inspect the IP address range the weave network use
1 2 3 4 5 6 7 8 |
$ ip addr show weave 6: weave: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue state UP group default qlen 1000 link/ether ce:13:79:1a:ed:6c brd ff:ff:ff:ff:ff:ff inet 10.44.0.0/12 brd 10.47.255.255 scope global weave valid_lft forever preferred_lft forever inet6 fe80::cc13:79ff:fe1a:ed6c/64 scope link valid_lft forever |
IP Address Management(IPAM)
- DHCP
-
static
-
host-local
stores the state locally on the host filesystem, therefore ensuring uniqueness of IP addresses on a single host.
-
other, such as weave’s ipm
by default Weave allocates the IP range 10.32.0.0/12 for the entire network, that gives the network IP from 10.32.0.1 to 10.47.255.254 to use on PODs, and the weave peers decide to split the IP address equally and assigns one portion to each node
Service Networking
Service is a cluster wide concept, there is no server or service really listening on the IP of the service, there is no processes or namespaces or interfaces for a service, it’s just a virtual object
When we create a service object in kubernetes, it is assigned an IP address from a pre-defined range. The kube-proxy component that running on each node creates forwarding rules for the service IP, forwarding traffics from the service IP to the POD IP
The kube-proxy supports different ways to create these rules (proxy-mode)
- userspace
- ipvs
- iptables (default)
Inspect the service IP range
1 2 3 4 |
$ ps -aux | grep kube-api root 1943 3.8 15.4 399512 316232 ? Ssl 12:08 1:02 kube-apiserver --service-cluster-ip-range=10.96.0.0/12 ... |
Inspect the proxy mode
1 2 3 4 5 6 7 8 9 10 11 |
$ kubectl logs kube-proxy-ffncj -n kube-system W0105 12:09:14.662118 1 server_others.go:287] Flag proxy-mode="" unknown, assuming iptables proxy I0105 12:09:14.663302 1 server_others.go:140] Using iptables Proxier.W0105 12:09:14.698727 1 proxier.go:311] clusterCIDR not specified, unable to distinguish between internal and external traffic I0105 12:09:14.698840 1 server_others.go:174] Tearing down inactive rules.I0105 12:09:14.808097 1 server.go:448] Version: v1.11.3 I0105 12:09:14.832692 1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072 I0105 12:09:14.833060 1 conntrack.go:52] Setting nf_conntrack_max to 131072I0105 12:09:14.833502 1 conntrack.go:83] Setting conntrack hashsize to 32768 I0105 12:09:14.833715 1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400I0105 12:09:14.833768 1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600 I0105 12:09:14.833966 1 config.go:102] Starting endpoints config controllerI0105 12:09:14.833995 1 controller_utils.go:1025] Waiting for caches to sync for endpoints config controller I0105 12:09:14.834299 1 config.go:202] Starting service config controllerI0105 12:09:14.834328 1 controller_utils.go:1025] Waiting for caches to sync for service config controller I0105 12:09:14.934171 1 controller_utils.go:1032] Caches are synced for endpoints config controllerI0105 12:09:14.934523 1 controller_utils.go:103 |
Cluster DNS Resolution
Kubernetes deploys a built-in DNS server by default, you need to do it by yourself if you setup Kubernetes manually. When a service is created, the kubernetes DNS creates a record for the service, it maps the service name to the IP address
For each namespace the DNS server creates a subdomain. All the services are grouped together into another subdomain called svc
All the services and PODs are grouped together into a root domain for the cluster, which is set to cluster.local
by default
For example curl an web-service in namespace apps:
1 2 3 4 5 6 |
curl https://web-service curl https://web-service.apps curl https://web-service.apps.svc # FQDN curl https://web-service.apps.svc.cluster.local |
Records for PODs are not created by default. But we can enable that explicitly. Once enabled, Records are created for pods as well. It does not use the POD name, it generates a name by replacing the dots in the IP address with dashes
| Hostname | Namespace | Type | Root | IP Address |
|————-|———–|——|—————|—————|
| web-service | apps | svc | cluster.local | 10.107.37.188 |
| 10-244-2-5 | apps | pod | cluster.local | 10.244.2.5 |
This is kind of how kubernetes does it
Except it does not create entries for PODs to map pod name to its IP address, it does that for services
Prior to version v1.12 the DNS implemented by kubernetes was known as kube-dns. With Kubernetes version 1.12 the recommended DNS server is CoreDNS
The CoreDNS is deploy as PODs (Deployment with 2 replicas) in a cluster that runs the CoreDNS executable.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
$ kubectl exec coredns-78fcdf6894-clgxn ps -n kube-system PID USER TIME COMMAND 1 root 0:03 /coredns -conf /etc/coredns/Corefile 21 root $ kubectl exec coredns-78fcdf6894-clgxn cat /etc/coredns/Corefile -n kube-system .:53 { errors health kubernetes cluster.local in-addr.arpa ip6.arpa { pods insecure upstream failthrough in-addr.arpa ip6.arpa } prometheus :9153 proxy . /etc/resolv.conf cache 30 reload } |
This config file is passed to the POD as a ConfigMap object
1 2 |
$ kubectl get configmap -n kube-system |
When we deploy CoreDNS solution, it also creates a service to make it available to other components within a cluster. The service is named as kube-dns by default. The kubelet is responsible for configuring the nameserver to the DNS server IP of the PODs
If you try to manually lookup the service using nslookup or the host command, it will return the FQDN of the service. It look up for the full name because the resolv.conf file also has a search entry which is set to default.svc.cluster.local
, svc.cluster.local
and cluster.local
Notice that it only has search entries for service. You need to specify the FQDN of the pod
1 2 |
$ host 10-244-2-5.default.pod.cluster.local |
Ingress
Service node ports can only allocate high numbered ports which are greater than 30000
If using public cloud, we can use service type LoadBalancer
to create a proxy server for the service. By doing that, Kubernetes sends a additional request to cloud provider to provision a network load balancer, configured to route traffic to the service ports
When your application scales, you need another proxy to redirect traffic to multiple load balancers, that may increase the complexity when you need to configure the firewall rules and SSL for each applications
That’s where Ingress comes in, Ingress is a layer 7 load balancer built-in to the cluster, it can helps users access applications using a single externally accessible URL that can configured to route to different services within the cluster base on the URL path and implement SSL security as well
There are solutions avaliable for ingress
- GCE
- Nginx
- HAProxy
- Trafik
- Istio
GCE and Nginx are currently being supported and maintained by the Kubernetes project. An Ingress Controller is just a web server, which has additional intelligence to monitor the Kubernetes cluster for new definitions or ingress resources
To manually configure a nginx ingress controller, you will need to create these objectives:
- Deployment: expose container port 80 and 443
- Service: expose node port 80 and 443
- ConfigMap: store nginx configuration files
- ServiceAccount, Role, ClusterRole, RoleBinding: for permissions to re-configure at resource changes
You might want to deploy an additional service use for default backend, which means if traffic does not match any rules, it direct to this service as default
See the full configurations using helm chart nginx-ingress
Storage
Persistent Volume
Except storing a volume on the host, Kubernetes provide serveral type of storage solutions
- NFS
- GlusterFS
- Flocker
- Ceph
- ScaleIO
- AWS EBS
- Azure Disk
- Google Persistent Disk
Persistent Volume Claim
Administrators creates PV, and users creates PVC to use the PV, during the binding process Kubernetes tries to find the PV that has sufficient capacity as requested by the claim and any other request properties such as access modes, volume modes, storage class and selector
Note that a smaller claim may get bound to a larger volume if all the other criteria matches and there are no better options
There is a one to one relationship bewteen PV and PVC, no other claims can utilize the remaining capacity in the volume
Configure the field persistentVolumeReclaimPolicy
to defined what action to perform to PV after a PVC deleted.
- Retain (default)
- Delete
- Recycle