HBase Basics
Apache HBase is an open source, scalable, consistent, low latency, random access data store
Source from Infinite Skills
Features
Horizontally Scalable
Linear increase in servers results in linear increases in storage capacity and I/O operations
CAP Trade off
In CAP theory, Hbase is more likely a CP type of system
- Consistency: ACID(atomicity, consistency, isolation, durability) garantees on rows
- Availability: Response time 2-3ms from cache, 10-20ms from disk
- Partition Tolerance: Failures don’t block system. It might take longer to response to maintain consistency
Dependencies
Apache ZooKeeper
- Use for distributed coordination of leaders for high availability
- Optimized to be highly avaiable for reads
- Not designed to scale for high write throughput
Apache Hadoop HDFS
- Provide data durability and reliability
- Optimized for sequential reads and writes of large files
- Does not provide random updates, only simple API for rando reads
- Cannot scale tens of billions of small entities (less then a few hundred MB)
Both system have their strengths but do not individually provide the same properties as HBase
Random Access
Optimized for small random reads
- Entities indexed for efficient random reads
Optimized for high throughput random writes
- Updates without requiring read
- Random writes via Log Structured Merge (LSM)
Short History
Inspired from Google’s Bigtable
Bigtable: A Distributed Storage System for Structured Data(2006)
BigTable
Datastore for Google’s Web Crawl Table
- Store web page content
- Web URL as key
- Use MapReduce to find links and generate backlinks
- Calculate page rank to build the Google index
Later, it also used as backend for Gmail, GA, Google Earth etc.
Hadoop HDFS
Inspired by Google distributed file system GFS
Timeline
Since 2009, many compaies (Yahoo, Facebook, eBay etc.) chose to use HBase for large scale production use case
In 2015, Google announced BigTable with HBase 1.0 compatible API support for its compute engine users
2017, HBase 2.0.0
2020, HBase 3.0.0
Despite being bucketed into NoSQL category of data storage, some of intresting are moving NoSQL back to SQL, by using HBase as a storage engine for SQL compliant OLTP database system.
Use case
HBase’s strengths are its ability to scale and sustain high write throughputs
Many HBase apps are:
- Ports from RDBMS to HBase
- New low-latency big data apps
How to Porting RDBMS to HBase?
- Many RDBMS are painful to scale
- Scale up is no longer pratical for massive data
- Data inconsistency was not acceptable when scaling reads
- Operationally gets more complicated as the number of replicas increases
- Operational techniques not sufficient when scaling writes
To make it easier to scale, we need to discard the fundamental features that RDBMS provides, such as:
- text search (LIKE)
- joins
- foreign keys and avoid constraint checks
Changing the schema, make it only contains denormalized tables, we won’t incur replication IO when sharding the RDBMS
Now you’re relatively straightforward porting RDBMS to HBase
Why choosing HBase instead?
- When your apps need high wirte and read throughput
- When you tired of RDMS’s fragile scaling operations
Data Volumes
- Entity data: information about the current state of a particular persion or thing
- Event data(or time series data): Records events that are generally spaced over many time intervals
Data volume explods when we need both of them
HBase or Not
Q: Does your app expect new data to be vailable immediately after an update?
- Yes: Use HBase
- When data queried, must reflect the most recent values
- Expect query responses in milliseconds
- No: No need for HBase
Q: Whether your app analytical or operational?
- Analytical: Not optimal for HBase
- Look for large set of data
- Often filter for particular time range
- Better choose Hadoop
- Operational: Use HBase
- Look for single or small set of entities
Q: Does your app expect updates to be available immediately after an update?
- Yes: Use HBase
- Frequently modified
- Pinpoint deletes
- Updates must be reflected within milliseconds
- No: No need for HBase
- Data is append-only
- Deletes in bulk or never
- Updates can be ignored until the next report is run
comparison
Workload | HBase | Hadoop |
---|---|---|
Low Latency | 1ms from cache 10ms from disk | 1min vis MR/Spark 1s via Impala |
Random Read | Rowkey is primary index | The small file problem |
Short Scan | Sorted and efficient | Bespoke partitioning can help |
Full Scan | Possible but non-optimal Improved pref w/MR on snapshots | Optimized with MR, Hive, Impala |
Updates | Optimized | Not supported |