Dataset
Datasets are top-level containers that are used to organize and control access to your tables and views.
location
BigQuery processes queries in the same location.
Location cannot be changed after creation.
Use BigQuery Data Transfer Service or Cloud Composer to transfer data accross different locations.
There are two types of locations:
- A region is a specific geographic place, such as
us-central1
. - A multi-region is a large geographic area, such as
EU
,US
considerations
- Colocate with external data source
- If dataset is in the US multi-regional location, GCS bucket must be in a multi-regional bucket in the US.
- If dataset is in Cloud Bigtable, your dataset must be in the US or the EU multi-regional location.
- Colocate with GCS buckets for loading and exporting data
- The GCS bucket must be in a regional or multi-regional bucket in the same location. Except US multi-regional location, you can load data from a GCS bucket in any regional or multi-regional location.
availability and durability
Failure domains
- Machine-level
- Zonal
- Regional
Failure types
- Soft: power failure, network partition, or a machine crash, should never lose data.
- Hard: damage from floods, terrorist attacks, earthquakes, and hurricanes, data might be lost.
Single region
- No backup or replication to another region.
- Better considering create cross-region backups.
Multi region
- Data is stored in a single region but is backed up in a geographically-separated region to provide resilience to a regional disaster.
Table
A BigQuery table contains individual records organized in rows. Each record is composed of columns (also called fields).
You can give a user access to specific tables or views without giving the user access to the complete dataset by ACL policy.
Table types
- Native
- External(federated): Bigtable, GCS, Cloud SQL, Drive
- Views
schema
A schema cannot contain more than 15 levels of nested RECORD types.
BigQuery performs best when your data is denormalized. Rather than preserving a relational schema such as a star or snowflake schema, denormalize your data and take advantage of nested and repeated columns
BigQuery only support these modify operations:
- Adding columns
- Deleting columns
- Relaxing a column’s mode from REQUIRED to NULLABLE
partitioning
A table that is divided into segments, called partitions.
Improve query performance.
Reducing the number of bytes read by a query. Lower cost.
BigQuery can more accurately estimate the bytes processed by a query before you run it. (take advantages of maintained metadata for partition)
Partitioned data is stored in physical blocks.
A table can be patritioned by:
- Time-unit column: daily, hourly, monthly, or yearly
- Ingestion time
- Integer range
clustering
Improve the performance of certain types of queries such as queries that use filters clauses and queries that aggregate data.
The order of columns for clustering is important.
BigQuery might not be able to accurately estimate the bytes processed by the query or the query costs.
You can also combine partitioning with clustering. Data is first partitioned and then data in each partition is clustered by the clustering columns.
Sorts the data based on clustering columns and organizes them in optimally sized storage blocks.
Overlapping keys in one cluster block weaken the sort property of the table. BigQuery performs automatic re-clustering in the background to restore the sort property of the table.
View
A view is a virtual table defined by a SQL query.
BigQuery’s standard views are logical views, not materialized views.
materialized views
Precompute the standard views and periodically cache the results for increased performance and efficiency.
Each base table can be referenced by up to 20 materialized views from the same dataset.
A materialized view can reference only a single table, and cannot use joins.
After a materialized view is created on top of a partitioned base table, the table’s partition expiration cannot be changed.
When a base table is queried, the query optimizer might automatically rewrite the query to use the cached result that is stored in the materialized view.
If the base table is partitioned, then you can partition a materialized view on the same partitioning column.
A materialized view might not be up-to-dated:
- Refresh not complete yet
- Automatic refresh is turned off
- Base table is updated too frequently, refresh been limited
Queries on a materialized view will be always up-to-dated:
- Query result is combined with delta changes from the base table
Use cases:
- Store grouped results (base on columns which often used as filter)
- Limit the date range in the materialized view (e.g. full data in 2019)
External Data Sources
Connection is limited by external data source locations.
Cloud SQL
Supports both MySQL and PostgreSQL instances.
Only supports Cloud SQL instances with public IP connectivity.
Use a Federated query syntax which is specified a connection string and a SQL query string.
BigQuery
You can query an external data source by using a permanent table or a temporary table.
permanent
Table is created in a dataset and is linked to the external data source.
Can be shared and re-used.
temporary
Submit a command that includes a query and creates a non-permanent table linked to the external data source.
Is useful for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.