Use MapReduce to find links and generate backlinks
Calculate page rank to build the Google index
Later, it also used as backend for Gmail, GA, Google Earth etc.
Hadoop HDFS
Inspired by Google distributed file system GFS
Timeline
Since 2009, many compaies (Yahoo, Facebook, eBay etc.) chose to use HBase for large scale production use case
In 2015, Google announced BigTable with HBase 1.0 compatible API support for its compute engine users
2017, HBase 2.0.0
2020, HBase 3.0.0
Despite being bucketed into NoSQL category of data storage, some of intresting are moving NoSQL back to SQL, by using HBase as a storage engine for SQL compliant OLTP database system.
Use case
HBase’s strengths are its ability to scale and sustain high write throughputs
Many HBase apps are:
Ports from RDBMS to HBase
New low-latency big data apps
How to Porting RDBMS to HBase?
Many RDBMS are painful to scale
Scale up is no longer pratical for massive data
Data inconsistency was not acceptable when scaling reads
Operationally gets more complicated as the number of replicas increases
Operational techniques not sufficient when scaling writes
To make it easier to scale, we need to discard the fundamental features that RDBMS provides, such as:
text search (LIKE)
joins
foreign keys and avoid constraint checks
Changing the schema, make it only contains denormalized tables, we won’t incur replication IO when sharding the RDBMS
Now you’re relatively straightforward porting RDBMS to HBase
Why choosing HBase instead?
When your apps need high wirte and read throughput
When you tired of RDMS’s fragile scaling operations
Data Volumes
Entity data: information about the current state of a particular persion or thing
Event data(or time series data): Records events that are generally spaced over many time intervals
Data volume explods when we need both of them
HBase or Not
Q: Does your app expect new data to be vailable immediately after an update?
Yes: Use HBase
When data queried, must reflect the most recent values
Expect query responses in milliseconds
No: No need for HBase
Q: Whether your app analytical or operational?
Analytical: Not optimal for HBase
Look for large set of data
Often filter for particular time range
Better choose Hadoop
Operational: Use HBase
Look for single or small set of entities
Q: Does your app expect updates to be available immediately after an update?
Yes: Use HBase
Frequently modified
Pinpoint deletes
Updates must be reflected within milliseconds
No: No need for HBase
Data is append-only
Deletes in bulk or never
Updates can be ignored until the next report is run
comparison
Workload
HBase
Hadoop
Low Latency
1ms from cache 10ms from disk
1min vis MR/Spark 1s via Impala
Random Read
Rowkey is primary index
The small file problem
Short Scan
Sorted and efficient
Bespoke partitioning can help
Full Scan
Possible but non-optimal Improved pref w/MR on snapshots
A gateway (in network terms) is a router that describes the function for connectivity
1
2
3
4
5
# For Network A
$ip route add192.168.2.0/24via192.168.1.1
# For Network B
$ip route add192.168.1.0/24via192.168.2.1
Default Gateway
If none of these forwarding rules in the routing table is appropriate for a given destination address, the default gateway is chosen as the default router of last resort
1
2
$ip route show default
Forwording packets between interfaces
By default in linux, packets are not forwarded from one interface to the next, for security reasons
Explicity allow it
1
2
echo1>/proc/sys/net/ipv4/ip_forward
Persists the settings
1
2
net.ipv4.ip_forward=1
DNS
Translate host name to IP address by configure the /etc/hosts
When a environment has too many entries and IP address are not persistent, we need a DNS server
1
2
3
$cat/etc/resolv.conf
nameserver192.168.1.100
The host will lookup an entry in /etc/hosts first, then lookup in the DNS. This order can be changed by configure file /etc/nsswitch.conf
1
2
3
4
5
6
7
8
9
10
11
12
13
$cat/etc/nsswitch.conf
passwd:files
group:files
shadow:files
gshadow:files
hosts:files dns
networks:files
protocols:db files
services:db files
ethers:db files
rpc:db files
netgroup:nis
You can configure the DNS server to forward unknown host name to the public name server in the Internet, for example reach www.google.com
private DNS → Root DNS → .com DNS → google DNS → cache the result
When looking for a host in the same domain, we want to simple use the host name not the full name, such as using web not web.mycompany.com, therefore we specify the domain name you want to append in /etc/resolv.conf
1
2
3
$cat/etc/resolv.conf
search mycompany.com
There are records stores in DNS with specific types:
A: ipv4
AAAA: ipv6
CNAME: name to name mapping
You can use tools like nslookup, dig to debug, note that nslookup only query from dns, not files
There are plenty DNS solutions, such as CoreDNS, except configure from files, CoreDNS supports other ways of configuring DNS entries through plugins like kubernetes
Network Namespace
A namespace is a way of scoping a particular set of identifiers
Linux provides namespaces for networking and processes, if a process is running within a process namespace, it can only see and communicate with other processes in the same namespace
Linux starts up with a default network namespace
Each network namespace has its own routing table and has its own set of iptables
1
2
3
4
5
6
7
8
9
10
11
# Create namespace
ip netns add red
# List namespace
ip netns list
# List interface
ip link
# List interface in namespace
ip netns exec red ip link
# or
ip-nred link
Connect namespaces together using a virtual Ethernet pair (or virtual cable, pipe)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Create veth pair
$ip link add veth-red type veth peer name veth-blue
# Attach each interface to the appropriate namespace
$ip link set veth-red netns red
$ip link set veth-blue netns blue
# Assign IP to each namespaces
$ip-nred addr add192.168.15.1dev veth-red
$ip-nblue addr add192.168.15.2dev veth-blue
# Bring up the interface for each device within the respective namespace
$ip-nred link set veth-red up
$ip-nblue link set veth-blue up
# List ARP table to see neighbor
$ip netns exec red arp
# Ping across namespace
$ip netns exec red ping192.168.15.2
When there more of namespaces need connected, use a virtial switch to create a virtial network. There few solutions:
Linux Bridge
Open vSwitch
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Create a virtial switch interface
$ip link addv-net-0type bridge
# Bring the interface up
$ip link set devv-net-0up
# Create cables for each namespace to connect to the bridge
$ip link add veth-red type veth peer name veth-red-br
$ip link add veth-blue type veth peer name veth-blue-br
# Attach one end to the appropriate namespace
$ip link set veth-red netns red
$ip link set veth-blue netns blue
# Attach the other end to the bridge
$ip link set veth-red-br masterv-net-0
$ip link set veth-blue-br masterv-net-0
# Assign IP to each namespaces
$ip-nred addr add192.168.15.1dev veth-red
$ip-nblue addr add192.168.15.2dev veth-blue
# Bring up the interface for each device within the respective namespace
$ip-nred link set veth-red up
$ip-nblue link set veth-blue up
# Assign IP address to the bridge (since it’s just another interface on the host)
$ip addr add12.168.15.3/24devv-net-0
# Ping accross namespaces
$ip netns exec red ping192.168.15.2
When a private virtual network need to reach the outer network, it need a gateway, the host is the gateway
1
2
$ip netns exec red ip route add192.168.1.0/24via192.168.15.3
For destination network to response, enable NAT on host acting as a gateway.
Add a new rule in the NAT IP table in the POSTROUTING chain to masquerade or replace the from address on all packets coming from the source network 192.168.15.0 with its own IP address.
Thus anyone receiving these packets outside the network will think that they are coming from the host and not from within the namespaces
1
2
$iptables-tnat-APOSTROUTING-s192.168.15.0/24-jMAS
Add a route using default gateway to outside world
1
2
$ip netns exec red ip route add defaultvia192.168.15.3
For outside world to reach the namespace in a private network, add a port forwarding rule using IP tables to say any traffic coming to port 80 on the localhost is to be forwarded to port 80 on the IP assigned to the namespace