S Sivart

HBase Basics

  • Devops

Apache HBase is an open source, scalable, consistent, low latency, random access data store

Source from Infinite Skills

Features

Horizontally Scalable

Linear increase in servers results in linear increases in storage capacity and I/O operations

image

CAP Trade off

In CAP theory, Hbase is more likely a CP type of system

  • Consistency: ACID(atomicity, consistency, isolation, durability) garantees on rows
  • Availability: Response time 2-3ms from cache, 10-20ms from disk
  • Partition Tolerance: Failures don’t block system. It might take longer to response to maintain consistency

Dependencies

Apache ZooKeeper

  • Use for distributed coordination of leaders for high availability
  • Optimized to be highly avaiable for reads
  • Not designed to scale for high write throughput

Apache Hadoop HDFS

  • Provide data durability and reliability
  • Optimized for sequential reads and writes of large files
  • Does not provide random updates, only simple API for rando reads
  • Cannot scale tens of billions of small entities (less then a few hundred MB)

Both system have their strengths but do not individually provide the same properties as HBase

Random Access

Optimized for small random reads

  • Entities indexed for efficient random reads

Optimized for high throughput random writes

  • Updates without requiring read
  • Random writes via Log Structured Merge (LSM)

Short History

Inspired from Google’s Bigtable

Bigtable: A Distributed Storage System for Structured Data(2006)

BigTable

Datastore for Google’s Web Crawl Table

  • Store web page content
  • Web URL as key
  • Use MapReduce to find links and generate backlinks
  • Calculate page rank to build the Google index

Later, it also used as backend for Gmail, GA, Google Earth etc.

Hadoop HDFS

Inspired by Google distributed file system GFS

Timeline

Since 2009, many compaies (Yahoo, Facebook, eBay etc.) chose to use HBase for large scale production use case

In 2015, Google announced BigTable with HBase 1.0 compatible API support for its compute engine users

2017, HBase 2.0.0

2020, HBase 3.0.0

Despite being bucketed into NoSQL category of data storage, some of intresting are moving NoSQL back to SQL, by using HBase as a storage engine for SQL compliant OLTP database system.

Use case

HBase’s strengths are its ability to scale and sustain high write throughputs

Many HBase apps are:

  • Ports from RDBMS to HBase
  • New low-latency big data apps

How to Porting RDBMS to HBase?

  • Many RDBMS are painful to scale
  • Scale up is no longer pratical for massive data
  • Data inconsistency was not acceptable when scaling reads
  • Operationally gets more complicated as the number of replicas increases
  • Operational techniques not sufficient when scaling writes

To make it easier to scale, we need to discard the fundamental features that RDBMS provides, such as:

  • text search (LIKE)
  • joins
  • foreign keys and avoid constraint checks

Changing the schema, make it only contains denormalized tables, we won’t incur replication IO when sharding the RDBMS

Now you’re relatively straightforward porting RDBMS to HBase

Why choosing HBase instead?

  • When your apps need high wirte and read throughput
  • When you tired of RDMS’s fragile scaling operations

Data Volumes

  • Entity data: information about the current state of a particular persion or thing
  • Event data(or time series data): Records events that are generally spaced over many time intervals

Data volume explods when we need both of them

HBase or Not

Q: Does your app expect new data to be vailable immediately after an update?

  • Yes: Use HBase
    • When data queried, must reflect the most recent values
    • Expect query responses in milliseconds
  • No: No need for HBase

Q: Whether your app analytical or operational?

  • Analytical: Not optimal for HBase
    • Look for large set of data
    • Often filter for particular time range
    • Better choose Hadoop
  • Operational: Use HBase
    • Look for single or small set of entities

Q: Does your app expect updates to be available immediately after an update?

  • Yes: Use HBase
    • Frequently modified
    • Pinpoint deletes
    • Updates must be reflected within milliseconds
  • No: No need for HBase
    • Data is append-only
    • Deletes in bulk or never
    • Updates can be ignored until the next report is run

comparison

Workload HBase Hadoop
Low Latency 1ms from cache 10ms from disk 1min vis MR/Spark 1s via Impala
Random Read Rowkey is primary index The small file problem
Short Scan Sorted and efficient Bespoke partitioning can help
Full Scan Possible but non-optimal Improved pref w/MR on snapshots Optimized with MR, Hive, Impala
Updates Optimized Not supported

Read More »HBase Basics

我想結束這一切

  • Quote

我常常在想,我們對他人的認識,並不是來自他們告訴我們的話,而是我們的觀察。別人可以隨心所欲,想對我們說什麼就說什麼。人家說「很高興認識你」的時候,實際上在想的是不同的事情,是在下某種判斷。「高興」從來不是他們真正的想法或感受,不過他們就是這麼說,我們也只能聽。

「這真奇怪。有機會可以看見金星,大多數人卻都在看電視。如果你當時也是,我無意冒犯。」

『我存在的意義,在於生命對我提出了一個問題。或者反過來說,我自己是對世界提出的一個問題,而我必須傳達我的答案,因為若不這麼做,我就要仰賴世界的答案了。』

親吻牽涉到兩個人。你不可能單獨成為最棒的接吻家。

「我想這就是為什麼有這麼多的人匆促踏入婚姻,陷在糟糕的感情關係裡,而且這種情形不分年齡,因為他們無法安然接受自己孤身一人。」

我們相信某些幫助我們理解人生的建構物。不只是去理解,而是當成一種提供安慰的手段。跟某個人共度餘生會讓我們比較幸福,這種觀念並不是關於存在的固有真理。我們只是希望這個信念成真。

也許我們就是這樣判斷一段感情是否真實。先前與我們無關的他人,以一種我們從未想過、或者從來難以置信的方式來認識我們。

然而在具體層面上,他現在已經完全不同於照片裡的那個孩子。他的意思不只是他的外表不一樣了,而是說在那幅影像裡捕捉到的每個細胞都已經死去、脫落,被新的細胞所取代。

Kubernetes Short Notes(4)

  • Devops

Storage

Persistent Volume

Except storing a volume on the host, Kubernetes provide serveral type of storage solutions

  • NFS
  • GlusterFS
  • Flocker
  • Ceph
  • ScaleIO
  • AWS EBS
  • Azure Disk
  • Google Persistent Disk

Persistent Volume Claim

Administrators creates PV, and users creates PVC to use the PV, during the binding process Kubernetes tries to find the PV that has sufficient capacity as requested by the claim and any other request properties such as access modes, volume modes, storage class and selector

Note that a smaller claim may get bound to a larger volume if all the other criteria matches and there are no better options

There is a one to one relationship bewteen PV and PVC, no other claims can utilize the remaining capacity in the volume

Configure the field persistentVolumeReclaimPolicy to defined what action to perform to PV after a PVC deleted.

  • Retain (default)
  • Delete
  • Recycle

Networking

Networking for Linux Basics

Network Switch

A switch is a device in a computer network that connects other devices together, can only enable a communication within a network

Host A(192.168.1.10)[eth0] ↔ Switch(192.168.1.0) ↔ [eth0]Host B(192.168.1.11)

Router

A router is a device/service that provides the function of routing IP packets between networks

Switch(192.168.1.0) <–> [192.168.1.1]Router[192.168.2.1] <–> Switch(192.168.2.0)

Route/Gateway

A gateway (in network terms) is a router that describes the function for connectivity

Default Gateway

If none of these forwarding rules in the routing table is appropriate for a given destination address, the default gateway is chosen as the default router of last resort

Forwording packets between interfaces

By default in linux, packets are not forwarded from one interface to the next, for security reasons

Explicity allow it

Persists the settings

DNS

Translate host name to IP address by configure the /etc/hosts

When a environment has too many entries and IP address are not persistent, we need a DNS server

The host will lookup an entry in /etc/hosts first, then lookup in the DNS. This order can be changed by configure file /etc/nsswitch.conf

You can configure the DNS server to forward unknown host name to the public name server in the Internet, for example reach www.google.com

private DNS → Root DNS → .com DNS → google DNS → cache the result

When looking for a host in the same domain, we want to simple use the host name not the full name, such as using web not web.mycompany.com, therefore we specify the domain name you want to append in /etc/resolv.conf

There are records stores in DNS with specific types:

  • A: ipv4
  • AAAA: ipv6
  • CNAME: name to name mapping

You can use tools like nslookup, dig to debug, note that nslookup only query from dns, not files

There are plenty DNS solutions, such as CoreDNS, except configure from files, CoreDNS supports other ways of configuring DNS entries through plugins like kubernetes

Network Namespace

A namespace is a way of scoping a particular set of identifiers

Linux provides namespaces for networking and processes, if a process is running within a process namespace, it can only see and communicate with other processes in the same namespace

Linux starts up with a default network namespace

Each network namespace has its own routing table and has its own set of iptables

Connect namespaces together using a virtual Ethernet pair (or virtual cable, pipe)

When there more of namespaces need connected, use a virtial switch to create a virtial network. There few solutions:

  • Linux Bridge
  • Open vSwitch

image

When a private virtual network need to reach the outer network, it need a gateway, the host is the gateway

For destination network to response, enable NAT on host acting as a gateway.

Add a new rule in the NAT IP table in the POSTROUTING chain to masquerade or replace the from address on all packets coming from the source network 192.168.15.0 with its own IP address.

Thus anyone receiving these packets outside the network will think that they are coming from the host and not from within the namespaces

Add a route using default gateway to outside world

For outside world to reach the namespace in a private network, add a port forwarding rule using IP tables to say any traffic coming to port 80 on the localhost is to be forwarded to port 80 on the IP assigned to the namespace

Read More »Kubernetes Short Notes(4)

Kubernetes Short Notes(3)

  • Devops

Cluster Maintainance

OS Upgrade

Pod Eviction Timeout

When the nodes was down for more than 5 minute(default) then the pods are terminated; pod will recreate if has replicaset

Drain, Cordon, Uncordon

We’re not sure the node will come back online in 5 minutes, therefore we can drain the node.

After the drained node upgraded and come back, it still unschedulable, uncordon the node to make it schedulable.

Note that the previouse pods won’t be automatically reschedule back to the nodes.

Cluster Upgrade

The core control plane components’s version can be different, but should follow certain rules:

  • the kube-api is the primary component, none of the other components’s version must not be higher than the kube-api
  • the components can be lower in 1-2 versions
    • kube-api: x
    • Controlloer-manager, kube-scheduler: x, x-1
    • kubelet, kube-proxy: x, x-1, x-2
  • the kubectl can be one version higher than kube-api: x+1, x, x-1

The kubernetes support only up to the recent 3 minor versions. The recommanded approch is to update one minor version at a time.

Update the cluster depend on how you deploy them:

  • cloud provider: few clicks at the UI
  • kubeadm: using upgrade argument (you should upgrade the kubeadm first!)
  • the hard way from scratch: manually upgrade components by yourself

Two major steps:

  1. upgrade master node, the control plane componets goes down, all management function are down, only the applications deploy on worker nodes keeps serving
  2. update worker nodes, with strategies:
    • upgrade all at once with downtimes
    • upgrade one at a time
    • create new nodes and remove the workloads, then finally remove old nodes

When you run command like kubectl get nodes, the VERSION is indicat the version of the kubelet

Backup and Restore

Master / Node DR

  • Cordon & drain
  • Provision replacement master / node

ETCD DR

Option: Backup resources

Saving objects as a copy by query the kube-api

Option: Backup ETCD

Making copies of the ETCD data directory

Or use the etcd command line tool

  1. Make a snap shot

    Remember to specify the certification files for authentication
  2. Stop kube-api
  3. Restore snapshot

    When ETCD restore from a backup, it initialize a new cluster configuration and configures the members of ETCD as new members to a new cluster. This is to prevent a new member from accidentally joining an existing cluster.
    For example, using a snapshot to provision a new etcd-cluster from testing purpose. You don’t want the members in the new test cluster to accidentally join the production cluster.

  4. Configure the etcd.service with new data directory and new cluster token

    During a restore, you must provide a new cluster token and the same initial cluster configuration

  5. Restart ETCD service
  6. Start kube-api

Persistant Volume DR

You can’t relay on kubernetes to for backing up and restore persistant volumes.

If you’re using cloud provider specific persistant volumes like EBS volumes, Azure managed disks or GCE persistent disks, you should use cloud provider snapshot APIs

Read More »Kubernetes Short Notes(3)

Google Ads API Beta(v2.1) Short Notes

Prerequisites

  • Developer token
    • generate from a Google Ads Manager Account
    • you can use the same token for not linked account
    • token state: pending, approved
    • only approved token can connect to the API for production
    • Use any developer token if you using Test Manager Account view doc
  • Custermer ID
    • the account number of an Google Ads account
    • can be set or replace programmatically in the SDK
  • OAuth2 client credential
    • generate from a GCP Project/API Console
    • use for managing the API users
  • Client Library

Basic first call (installed application flow)

Step 1

  1. register an Google Ads Manager Account for production use
  2. take the developer token from UI
    > TOOL > SETTINGS > API Center

step refs:

You must use a production (non-test) manager account’s developer token to make API calls against a test account. Even if the token is pending approval, you can still use it to make calls against test accounts.

Step 2

  1. create a Google Account for testing
  2. use the Google Account to register a Google Ads Manager Account
  3. login and create one customer account (the customer created by test manager account will be test only)
  4. create a campaign under the customer account

step refs:

  • https://developers.google.com/google-ads/api/docs/first-call/overview#test_account
  • https://developers.google.com/adwords/api/docs/guides/accounts-overview#test_accounts

Step 3

If already has an OAuth client:

  1. assume the client ID is ORIGIN.apps.googleusercontent.com
  2. go to API console, find the OAuth2.0 client ID then download the client secret json file
  3. use this secret to request a generate refresh token

else:

  1. create a GCP project
  2. enable the GoogleAds API in API console page
  3. create a OAuth client, assume the client ID is NEW.apps.googleusercontent.com
  4. download the client secret json file
  5. use this secret to request a generate refresh token

step refs:

  • https://developers.google.com/adwords/api/docs/guides/accounts-overview#usingoauth2withtestaccounts

When requesting an OAuth2 refresh token, make sure you’re logged in as the test manager account user

To access a test account using OAuth2, the test manager account user must grant permission to your client application. Therefore, when requesting a refresh token, ensure you’re logged in as the test manager account rather than the production manager account.

you want to switch from the test manager account to the production manager account, simply reconfigure the client library to use the production manager account’s refresh token.

Step 4

  1. install client library
  2. create an sdk config file google-ads.yml and insert the values:
    • developer_token
    • client_id
    • client_secret
    • refresh_token
    • logincustomerid
  3. initial a client object that loads the config file
  4. make the first call to search the campaign we just created

Read More »Google Ads API Beta(v2.1) Short Notes

無處可逃

  • Quote

意思是,當達碧還在年高中時,癌細胞就在她媽媽體內生長了。當她說謊解釋皮包裡為什麼會有一張破損的西爾斯百貨標籤時,癌細胞就在那了;當她嗑了劣質搖頭丸,手腕上戴著綠色螢光手環,在凌晨三點昏昏沉沉開車回家,噙著淚水的媽媽在門廊上大罵她是賤貨時,癌細胞就在那了。那個看不見的傢伙始終棲息在媽媽肩膀上,偷聽她們說話。媽媽一在慢慢死去,但兩人都毫無所覺。

珊蒂似乎很喜歡把他當孩子照顧,彷彿這樣才能給自己恨他的理由。