Use MapReduce to find links and generate backlinks
Calculate page rank to build the Google index
Later, it also used as backend for Gmail, GA, Google Earth etc.
Hadoop HDFS
Inspired by Google distributed file system GFS
Timeline
Since 2009, many compaies (Yahoo, Facebook, eBay etc.) chose to use HBase for large scale production use case
In 2015, Google announced BigTable with HBase 1.0 compatible API support for its compute engine users
2017, HBase 2.0.0
2020, HBase 3.0.0
Despite being bucketed into NoSQL category of data storage, some of intresting are moving NoSQL back to SQL, by using HBase as a storage engine for SQL compliant OLTP database system.
Use case
HBase’s strengths are its ability to scale and sustain high write throughputs
Many HBase apps are:
Ports from RDBMS to HBase
New low-latency big data apps
How to Porting RDBMS to HBase?
Many RDBMS are painful to scale
Scale up is no longer pratical for massive data
Data inconsistency was not acceptable when scaling reads
Operationally gets more complicated as the number of replicas increases
Operational techniques not sufficient when scaling writes
To make it easier to scale, we need to discard the fundamental features that RDBMS provides, such as:
text search (LIKE)
joins
foreign keys and avoid constraint checks
Changing the schema, make it only contains denormalized tables, we won’t incur replication IO when sharding the RDBMS
Now you’re relatively straightforward porting RDBMS to HBase
Why choosing HBase instead?
When your apps need high wirte and read throughput
When you tired of RDMS’s fragile scaling operations
Data Volumes
Entity data: information about the current state of a particular persion or thing
Event data(or time series data): Records events that are generally spaced over many time intervals
Data volume explods when we need both of them
HBase or Not
Q: Does your app expect new data to be vailable immediately after an update?
Yes: Use HBase
When data queried, must reflect the most recent values
Expect query responses in milliseconds
No: No need for HBase
Q: Whether your app analytical or operational?
Analytical: Not optimal for HBase
Look for large set of data
Often filter for particular time range
Better choose Hadoop
Operational: Use HBase
Look for single or small set of entities
Q: Does your app expect updates to be available immediately after an update?
Yes: Use HBase
Frequently modified
Pinpoint deletes
Updates must be reflected within milliseconds
No: No need for HBase
Data is append-only
Deletes in bulk or never
Updates can be ignored until the next report is run
comparison
Workload
HBase
Hadoop
Low Latency
1ms from cache 10ms from disk
1min vis MR/Spark 1s via Impala
Random Read
Rowkey is primary index
The small file problem
Short Scan
Sorted and efficient
Bespoke partitioning can help
Full Scan
Possible but non-optimal Improved pref w/MR on snapshots
Except storing a volume on the host, Kubernetes provide serveral type of storage solutions
NFS
GlusterFS
Flocker
Ceph
ScaleIO
AWS EBS
Azure Disk
Google Persistent Disk
Persistent Volume Claim
Administrators creates PV, and users creates PVC to use the PV, during the binding process Kubernetes tries to find the PV that has sufficient capacity as requested by the claim and any other request properties such as access modes, volume modes, storage class and selector
Note that a smaller claim may get bound to a larger volume if all the other criteria matches and there are no better options
There is a one to one relationship bewteen PV and PVC, no other claims can utilize the remaining capacity in the volume
Configure the field persistentVolumeReclaimPolicy to defined what action to perform to PV after a PVC deleted.
Retain (default)
Delete
Recycle
Networking
Networking for Linux Basics
Network Switch
A switch is a device in a computer network that connects other devices together, can only enable a communication within a network
A gateway (in network terms) is a router that describes the function for connectivity
1
2
3
4
5
# For Network A
$ip route add192.168.2.0/24via192.168.1.1
# For Network B
$ip route add192.168.1.0/24via192.168.2.1
Default Gateway
If none of these forwarding rules in the routing table is appropriate for a given destination address, the default gateway is chosen as the default router of last resort
1
2
$ip route show default
Forwording packets between interfaces
By default in linux, packets are not forwarded from one interface to the next, for security reasons
Explicity allow it
1
2
echo1>/proc/sys/net/ipv4/ip_forward
Persists the settings
1
2
net.ipv4.ip_forward=1
DNS
Translate host name to IP address by configure the /etc/hosts
When a environment has too many entries and IP address are not persistent, we need a DNS server
1
2
3
$cat/etc/resolv.conf
nameserver192.168.1.100
The host will lookup an entry in /etc/hosts first, then lookup in the DNS. This order can be changed by configure file /etc/nsswitch.conf
1
2
3
4
5
6
7
8
9
10
11
12
13
$cat/etc/nsswitch.conf
passwd:files
group:files
shadow:files
gshadow:files
hosts:files dns
networks:files
protocols:db files
services:db files
ethers:db files
rpc:db files
netgroup:nis
You can configure the DNS server to forward unknown host name to the public name server in the Internet, for example reach www.google.com
private DNS → Root DNS → .com DNS → google DNS → cache the result
When looking for a host in the same domain, we want to simple use the host name not the full name, such as using web not web.mycompany.com, therefore we specify the domain name you want to append in /etc/resolv.conf
1
2
3
$cat/etc/resolv.conf
search mycompany.com
There are records stores in DNS with specific types:
A: ipv4
AAAA: ipv6
CNAME: name to name mapping
You can use tools like nslookup, dig to debug, note that nslookup only query from dns, not files
There are plenty DNS solutions, such as CoreDNS, except configure from files, CoreDNS supports other ways of configuring DNS entries through plugins like kubernetes
Network Namespace
A namespace is a way of scoping a particular set of identifiers
Linux provides namespaces for networking and processes, if a process is running within a process namespace, it can only see and communicate with other processes in the same namespace
Linux starts up with a default network namespace
Each network namespace has its own routing table and has its own set of iptables
1
2
3
4
5
6
7
8
9
10
11
# Create namespace
ip netns add red
# List namespace
ip netns list
# List interface
ip link
# List interface in namespace
ip netns exec red ip link
# or
ip-nred link
Connect namespaces together using a virtual Ethernet pair (or virtual cable, pipe)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Creat veth pair
$ip link add veth-red type veth peer name veth-blud
# Attach each interface to the appropriate namespace
$ip link set veth-red netns red
$ip link set veth-blue netns blue
# Assign IP to each namespaces
$ip-nred addr add192.168.15.1dev veth-red
$ip-nblue addr add192.168.15.2dev veth-blue
# Bring up the interface for each device within the respective namespace
$ip-nred link set veth-red up
$ip-nblue link set veth-blue up
# List ARP table to see neighbor
$ip netns exec red arp
# Ping across namespace
$ip netns exec red ping192.168.15.2
When there more of namespaces need connected, use a virtial switch to create a virtial network. There few solutions:
Linux Bridge
Open vSwitch
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Create a virtial switch interface
ip link addv-net-0type bridge
# Bring the interface up
ip link set devv-net-0up
# Delete the previous cable and create new cable to connect to the bridge
ip link add veth-red type veth peer name veth-red-br
ip link add veth-blue type veth peer name veth-blue-br
# Attach one end to the appropriate namespace
$ip link set veth-red netns red
$ip link set veth-blue netns blue
# Attach the other end to the bridge
$ip link set veth-red-br masterv-net-0
$ip link set veth-blue-br masterv-net-0
# Assign IP to each namespaces
$ip-nred addr add192.168.15.1dev veth-red
$ip-nblue addr add192.168.15.2dev veth-blue
# Bring up the interface for each device within the respective namespace
$ip-nred link set veth-red up
$ip-nblue link set veth-blue up
# Assign IP address to the bridge (since it’s just another interface on the host)
ip addr add12.168.15.3/24devv-net-0
# Ping accross namespaces
$ip netns exec red ping192.168.15.2
When a private virtual network need to reach the outer network, it need a gateway, the host is the gateway
1
2
$ip netns exec red ip route add192.168.1.0/24via192.168.15.3
For destination network to response, enable NAT on host acting as a gateway.
Add a new rule in the NAT IP table in the POSTROUTING chain to masquerade or replace the from address on all packets coming from the source network 192.168.15.0 with its own IP address.
Thus anyone receiving these packets outside the network will think that they are coming from the host and not from within the namespaces
1
2
$iptables-tnat-APOSTROUTING-s192.168.15.0/24-jMAS
Add a route using default gateway to outside world
1
2
$ip netns exec red ip route add defaultvia192.168.15.3
For outside world to reach the namespace in a private network, add a port forwarding rule using IP tables to say any traffic coming to port 80 on the localhost is to be forwarded to port 80 on the IP assigned to the namespace
When the nodes was down for more than 5 minute(default) then the pods are terminated; pod will recreate if has replicaset
Drain, Cordon, Uncordon
We’re not sure the node will come back online in 5 minutes, therefore we can drain the node.
After the drained node upgraded and come back, it still unschedulable, uncordon the node to make it schedulable.
Note that the previouse pods won’t be automatically reschedule back to the nodes.
Cluster Upgrade
The core control plane components’s version can be different, but should follow certain rules:
the kube-api is the primary component, none of the other components’s version must not be higher than the kube-api
the components can be lower in 1-2 versions
kube-api: x
Controlloer-manager, kube-scheduler: x, x-1
kubelet, kube-proxy: x, x-1, x-2
the kubectl can be one version higher than kube-api: x+1, x, x-1
The kubernetes support only up to the recent 3 minor versions. The recommanded approch is to update one minor version at a time.
Update the cluster depend on how you deploy them:
cloud provider: few clicks at the UI
kubeadm: using upgrade argument (you should upgrade the kubeadm first!)
the hard way from scratch: manually upgrade components by yourself
Two major steps:
upgrade master node, the control plane componets goes down, all management function are down, only the applications deploy on worker nodes keeps serving
update worker nodes, with strategies:
upgrade all at once with downtimes
upgrade one at a time
create new nodes and remove the workloads, then finally remove old nodes
When you run command like kubectl get nodes, the VERSION is indicat the version of the kubelet
Backup and Restore
Master / Node DR
Cordon & drain
Provision replacement master / node
ETCD DR
Option: Backup resources
Saving objects as a copy by query the kube-api
1
2
kubectl get all--namespace=default-oyaml>default-deplayment-services.yaml
Option: Backup ETCD
Making copies of the ETCD data directory
1
2
3
4
5
6
# etcd.service
ExecStart=/user/local/bin/etcd\
--name=${ETCD_NAME}
...
--data-dir=/var/lib/etcd
Or use the etcd command line tool
Make a snap shot
1
2
3
4
5
6
ETCDTL_API=3etcdctl backup save etcd.db\
--endpoint=http://127.0.0.1:2379 \
--cacert=/etc/etcd/ca.crt\
--cert=/etc/etcd/etcd-server.crt\
--key=/etc/etcd/etcd
Remember to specify the certification files for authentication
When ETCD restore from a backup, it initialize a new cluster configuration and configures the members of ETCD as new members to a new cluster. This is to prevent a new member from accidentally joining an existing cluster.
For example, using a snapshot to provision a new etcd-cluster from testing purpose. You don’t want the members in the new test cluster to accidentally join the production cluster.
Configure the etcd.service with new data directory and new cluster token
During a restore, you must provide a new cluster token and the same initial cluster configuration
Restart ETCD service
1
2
3
systemctl daemon-reload
service etcd restart
Start kube-api
1
2
service kube-apiserver start
Persistant Volume DR
You can’t relay on kubernetes to for backing up and restore persistant volumes.
If you’re using cloud provider specific persistant volumes like EBS volumes, Azure managed disks or GCE persistent disks, you should use cloud provider snapshot APIs
Set the pods’ tolerance, three behavior are avaliable if not tolerant:
NoSchedule
PreferNoSchedule: not guaranteed
NoExecute: new pods=NoSchedule, existed pods=evicted
Note the value in tolerations keys must use double quotes
Node Selector
Limit the pod to get scheduled to one kind of node only
Lable the node
Set the nodeSelector
Note there is no OR or NOT conditions, use node affinity instead
Node Affinity
Limit the pod to get scheduled to one or more particular nodes
Lable the node
Set the nodeAffinity
operators: In, NotIn, Exists, DoesNotExist, Gt, Lt
3 types
Combines the Taint/Toleration with NodeSelector or NodeAffinity to cover the scenarios
Resources
Request
The scheduling base on the resource requests
By default, k8s assumes a pod requires 0.5 cpu and 256Mi memory
Limit
By default, k8s limit a pod to 1 cpu and 512Mi memory
When a pod try to exceed resources beyond the limit
cpu: k8s throttles the cpu won’t kill
memory: k8s kill the pod with OOM
Static Pods
Use in creating control plane components (kube admin tools)
Without the intervention from the kube-api server, the kubelet can manage a node independently by monitor config files in the file system, and be able to create, recreate, update and delete the POD only object
–pod-manifest-path=/etc/Kubernetes/manifest
–config=kubeconfig.yaml (staticPodPath)
While the static pod created, the kube-api only get a readable mirror and not have the ability to update/delete it
Multiple Scheduler
copy the kube-scheduler configs from /etc/kubernetes/manifests
rename the scheduler --scheduler-name
if one master nodes with multiple scheduler:
set the --leader-elect=false
if multiple masters with multiple scheduler, only one scheduler can active at a time
set the --leader-elect=true
set the --lock-object-name to differentiate the custom scheduler from default if multiple master