Datasets are top-level containers that are used to organize and control access to your tables and views.
location
BigQuery processes queries in the same location.
Location cannot be changed after creation.
Use BigQuery Data Transfer Service or Cloud Composer to transfer data accross different locations.
There are two types of locations:
A region is a specific geographic place, such as us-central1.
A multi-region is a large geographic area, such as EU, US
considerations
Colocate with external data source
If dataset is in the US multi-regional location, GCS bucket must be in a multi-regional bucket in the US.
If dataset is in Cloud Bigtable, your dataset must be in the US or the EU multi-regional location.
Colocate with GCS buckets for loading and exporting data
The GCS bucket must be in a regional or multi-regional bucket in the same location. Except US multi-regional location, you can load data from a GCS bucket in any regional or multi-regional location.
availability and durability
Failure domains
Machine-level
Zonal
Regional
Failure types
Soft: power failure, network partition, or a machine crash, should never lose data.
Hard: damage from floods, terrorist attacks, earthquakes, and hurricanes, data might be lost.
Single region
No backup or replication to another region.
Better considering create cross-region backups.
Multi region
Data is stored in a single region but is backed up in a geographically-separated region to provide resilience to a regional disaster.
Use MapReduce to find links and generate backlinks
Calculate page rank to build the Google index
Later, it also used as backend for Gmail, GA, Google Earth etc.
Hadoop HDFS
Inspired by Google distributed file system GFS
Timeline
Since 2009, many compaies (Yahoo, Facebook, eBay etc.) chose to use HBase for large scale production use case
In 2015, Google announced BigTable with HBase 1.0 compatible API support for its compute engine users
2017, HBase 2.0.0
2020, HBase 3.0.0
Despite being bucketed into NoSQL category of data storage, some of intresting are moving NoSQL back to SQL, by using HBase as a storage engine for SQL compliant OLTP database system.
Use case
HBase’s strengths are its ability to scale and sustain high write throughputs
Many HBase apps are:
Ports from RDBMS to HBase
New low-latency big data apps
How to Porting RDBMS to HBase?
Many RDBMS are painful to scale
Scale up is no longer pratical for massive data
Data inconsistency was not acceptable when scaling reads
Operationally gets more complicated as the number of replicas increases
Operational techniques not sufficient when scaling writes
To make it easier to scale, we need to discard the fundamental features that RDBMS provides, such as:
text search (LIKE)
joins
foreign keys and avoid constraint checks
Changing the schema, make it only contains denormalized tables, we won’t incur replication IO when sharding the RDBMS
Now you’re relatively straightforward porting RDBMS to HBase
Why choosing HBase instead?
When your apps need high wirte and read throughput
When you tired of RDMS’s fragile scaling operations
Data Volumes
Entity data: information about the current state of a particular persion or thing
Event data(or time series data): Records events that are generally spaced over many time intervals
Data volume explods when we need both of them
HBase or Not
Q: Does your app expect new data to be vailable immediately after an update?
Yes: Use HBase
When data queried, must reflect the most recent values
Expect query responses in milliseconds
No: No need for HBase
Q: Whether your app analytical or operational?
Analytical: Not optimal for HBase
Look for large set of data
Often filter for particular time range
Better choose Hadoop
Operational: Use HBase
Look for single or small set of entities
Q: Does your app expect updates to be available immediately after an update?
Yes: Use HBase
Frequently modified
Pinpoint deletes
Updates must be reflected within milliseconds
No: No need for HBase
Data is append-only
Deletes in bulk or never
Updates can be ignored until the next report is run
comparison
Workload
HBase
Hadoop
Low Latency
1ms from cache 10ms from disk
1min vis MR/Spark 1s via Impala
Random Read
Rowkey is primary index
The small file problem
Short Scan
Sorted and efficient
Bespoke partitioning can help
Full Scan
Possible but non-optimal Improved pref w/MR on snapshots