bigtable – Super9

HBase Basics

2020 年 7 月 22 日2020 年 7 月 27 日
Ops

Apache HBase is an open source, scalable, consistent, low latency, random access data store

Source from Infinite Skills

Features

Horizontally Scalable

Linear increase in servers results in linear increases in storage capacity and I/O operations

CAP Trade off

In CAP theory, Hbase is more likely a CP type of system

Consistency: ACID(atomicity, consistency, isolation, durability) garantees on rows
Availability: Response time 2-3ms from cache, 10-20ms from disk
Partition Tolerance: Failures don’t block system. It might take longer to response to maintain consistency

Dependencies

Apache ZooKeeper

Use for distributed coordination of leaders for high availability
Optimized to be highly avaiable for reads
Not designed to scale for high write throughput

Apache Hadoop HDFS

Provide data durability and reliability
Optimized for sequential reads and writes of large files
Does not provide random updates, only simple API for rando reads
Cannot scale tens of billions of small entities (less then a few hundred MB)

Both system have their strengths but do not individually provide the same properties as HBase

Random Access

Optimized for small random reads

Entities indexed for efficient random reads

Optimized for high throughput random writes

Updates without requiring read
Random writes via Log Structured Merge (LSM)

Short History

Inspired from Google’s Bigtable

Bigtable: A Distributed Storage System for Structured Data(2006)

BigTable

Datastore for Google’s Web Crawl Table

Store web page content
Web URL as key
Use MapReduce to find links and generate backlinks
Calculate page rank to build the Google index

Later, it also used as backend for Gmail, GA, Google Earth etc.

Hadoop HDFS

Inspired by Google distributed file system GFS

Timeline

Since 2009, many compaies (Yahoo, Facebook, eBay etc.) chose to use HBase for large scale production use case

In 2015, Google announced BigTable with HBase 1.0 compatible API support for its compute engine users

2017, HBase 2.0.0

2020, HBase 3.0.0

Despite being bucketed into NoSQL category of data storage, some of intresting are moving NoSQL back to SQL, by using HBase as a storage engine for SQL compliant OLTP database system.

Use case

HBase’s strengths are its ability to scale and sustain high write throughputs

Many HBase apps are:

Ports from RDBMS to HBase
New low-latency big data apps

How to Porting RDBMS to HBase?

Many RDBMS are painful to scale
Scale up is no longer pratical for massive data
Data inconsistency was not acceptable when scaling reads
Operationally gets more complicated as the number of replicas increases
Operational techniques not sufficient when scaling writes

To make it easier to scale, we need to discard the fundamental features that RDBMS provides, such as:

text search (LIKE)
joins
foreign keys and avoid constraint checks

Changing the schema, make it only contains denormalized tables, we won’t incur replication IO when sharding the RDBMS

Now you’re relatively straightforward porting RDBMS to HBase

Why choosing HBase instead?

When your apps need high wirte and read throughput
When you tired of RDMS’s fragile scaling operations

Data Volumes

Entity data: information about the current state of a particular persion or thing
Event data(or time series data): Records events that are generally spaced over many time intervals

Data volume explods when we need both of them

HBase or Not

Q: Does your app expect new data to be vailable immediately after an update?

Yes: Use HBase
- When data queried, must reflect the most recent values
- Expect query responses in milliseconds
No: No need for HBase

Q: Whether your app analytical or operational?

Analytical: Not optimal for HBase
- Look for large set of data
- Often filter for particular time range
- Better choose Hadoop
Operational: Use HBase
- Look for single or small set of entities

Q: Does your app expect updates to be available immediately after an update?

Yes: Use HBase
- Frequently modified
- Pinpoint deletes
- Updates must be reflected within milliseconds
No: No need for HBase
- Data is append-only
- Deletes in bulk or never
- Updates can be ignored until the next report is run

comparison

Workload	HBase	Hadoop
Low Latency	1ms from cache 10ms from disk	1min vis MR/Spark 1s via Impala
Random Read	Rowkey is primary index	The small file problem
Short Scan	Sorted and efficient	Bespoke partitioning can help
Full Scan	Possible but non-optimal Improved pref w/MR on snapshots	Optimized with MR, Hive, Impala
Updates	Optimized	Not supported