BigQuery Short Notes

  • Devops, Misc

Dataset

Datasets are top-level containers that are used to organize and control access to your tables and views.

location

BigQuery processes queries in the same location.

Location cannot be changed after creation.

Use BigQuery Data Transfer Service or Cloud Composer to transfer data accross different locations.

There are two types of locations:

  • A region is a specific geographic place, such as us-central1.
  • A multi-region is a large geographic area, such as EU, US

considerations

  • Colocate with external data source
    • If dataset is in the US multi-regional location, GCS bucket must be in a multi-regional bucket in the US.
    • If dataset is in Cloud Bigtable, your dataset must be in the US or the EU multi-regional location.
  • Colocate with GCS buckets for loading and exporting data
    • The GCS bucket must be in a regional or multi-regional bucket in the same location. Except US multi-regional location, you can load data from a GCS bucket in any regional or multi-regional location.

availability and durability

Failure domains

  • Machine-level
  • Zonal
  • Regional

Failure types

  • Soft: power failure, network partition, or a machine crash, should never lose data.
  • Hard: damage from floods, terrorist attacks, earthquakes, and hurricanes, data might be lost.

Single region

  • No backup or replication to another region.
  • Better considering create cross-region backups.

Multi region

  • Data is stored in a single region but is backed up in a geographically-separated region to provide resilience to a regional disaster.

Table

A BigQuery table contains individual records organized in rows. Each record is composed of columns (also called fields).

You can give a user access to specific tables or views without giving the user access to the complete dataset by ACL policy.

Table types

  • Native
  • External(federated): Bigtable, GCS, Cloud SQL, Drive
  • Views

schema

A schema cannot contain more than 15 levels of nested RECORD types.

BigQuery performs best when your data is denormalized. Rather than preserving a relational schema such as a star or snowflake schema, denormalize your data and take advantage of nested and repeated columns

BigQuery only support these modify operations:

  • Adding columns
  • Deleting columns
  • Relaxing a column’s mode from REQUIRED to NULLABLE

partitioning

A table that is divided into segments, called partitions.

Improve query performance.

Reducing the number of bytes read by a query. Lower cost.

BigQuery can more accurately estimate the bytes processed by a query before you run it. (take advantages of maintained metadata for partition)

Partitioned data is stored in physical blocks.

A table can be patritioned by:

  • Time-unit column: daily, hourly, monthly, or yearly
  • Ingestion time
  • Integer range

clustering

Improve the performance of certain types of queries such as queries that use filters clauses and queries that aggregate data.

The order of columns for clustering is important.

BigQuery might not be able to accurately estimate the bytes processed by the query or the query costs.

You can also combine partitioning with clustering. Data is first partitioned and then data in each partition is clustered by the clustering columns.

Sorts the data based on clustering columns and organizes them in optimally sized storage blocks.

Overlapping keys in one cluster block weaken the sort property of the table. BigQuery performs automatic re-clustering in the background to restore the sort property of the table.

View

A view is a virtual table defined by a SQL query.

BigQuery’s standard views are logical views, not materialized views.

materialized views

Precompute the standard views and periodically cache the results for increased performance and efficiency.

Each base table can be referenced by up to 20 materialized views from the same dataset.

A materialized view can reference only a single table, and cannot use joins.

After a materialized view is created on top of a partitioned base table, the table’s partition expiration cannot be changed.

When a base table is queried, the query optimizer might automatically rewrite the query to use the cached result that is stored in the materialized view.

If the base table is partitioned, then you can partition a materialized view on the same partitioning column.

A materialized view might not be up-to-dated:

  • Refresh not complete yet
  • Automatic refresh is turned off
  • Base table is updated too frequently, refresh been limited

Queries on a materialized view will be always up-to-dated:

  • Query result is combined with delta changes from the base table

Use cases:

  • Store grouped results (base on columns which often used as filter)
  • Limit the date range in the materialized view (e.g. full data in 2019)

External Data Sources

Connection is limited by external data source locations.

Cloud SQL

Supports both MySQL and PostgreSQL instances.

Only supports Cloud SQL instances with public IP connectivity.

Use a Federated query syntax which is specified a connection string and a SQL query string.

BigQuery

You can query an external data source by using a permanent table or a temporary table.

permanent

Table is created in a dataset and is linked to the external data source.

Can be shared and re-used.

temporary

Submit a command that includes a query and creates a non-permanent table linked to the external data source.

Is useful for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.

發佈留言

發佈留言必須填寫的電子郵件地址不會公開。