`airflow.contrib.hooks.bigquery_hook`¶

This module contains a BigQuery Hook, as well as a very basic PEP 249 implementation for BigQuery.

Module Contents¶

class airflow.contrib.hooks.bigquery_hook.BigQueryHook(bigquery_conn_id='bigquery_default', delegate_to=None, use_legacy_sql=True, location=None)[source]¶

Bases: airflow.contrib.hooks.gcp_api_base_hook.GoogleCloudBaseHook, airflow.hooks.dbapi_hook.DbApiHook

Interact with BigQuery. This hook uses the Google Cloud Platform connection.

conn_name_attr = bigquery_conn_id[source]¶

get_conn(self)[source]¶: Returns a BigQuery PEP 249 connection object.

get_service(self)[source]¶: Returns a BigQuery service object.

insert_rows(self, table, rows, target_fields=None, commit_every=1000)[source]¶: Insertion is currently unsupported. Theoretically, you could use BigQuery’s streaming API to insert rows into a table, but this hasn’t been implemented.

get_pandas_df(self, sql, parameters=None, dialect=None)[source]¶

Returns a Pandas DataFrame for the results produced by a BigQuery query. The DbApiHook method must be overridden because Pandas doesn’t support PEP 249 connections, except for SQLite. See:

https://github.com/pydata/pandas/blob/master/pandas/io/sql.py#L447 https://github.com/pydata/pandas/issues/6900

Parameters

sql (str) – The BigQuery SQL to execute.
parameters (mapping or iterable) – The parameters to render the SQL query with (not used, leave to override superclass method)
dialect (str in {'legacy', 'standard'}) – Dialect of BigQuery SQL – legacy SQL or standard SQL defaults to use self.use_legacy_sql if not specified

table_exists(self, project_id, dataset_id, table_id)[source]¶

Checks for the existence of a table in Google BigQuery.

Parameters

project_id (str) – The Google cloud project in which to look for the table. The connection supplied to the hook must provide access to the specified project.
dataset_id (str) – The name of the dataset in which to look for the table.
table_id (str) – The name of the table to check the existence of.

class airflow.contrib.hooks.bigquery_hook.BigQueryPandasConnector(project_id, service, reauth=False, verbose=False, dialect='legacy')[source]¶

Bases: pandas_gbq.gbq.GbqConnector

This connector behaves identically to GbqConnector (from Pandas), except that it allows the service to be injected, and disables a call to self.get_credentials(). This allows Airflow to use BigQuery with Pandas without forcing a three legged OAuth connection. Instead, we can inject service account credentials into the binding.

class airflow.contrib.hooks.bigquery_hook.BigQueryConnection(*args, **kwargs)[source]¶

Bases: object

BigQuery does not have a notion of a persistent connection. Thus, these objects are small stateless factories for cursors, which do all the real work.

close(self)[source]¶: BigQueryConnection does not have anything to close.

commit(self)[source]¶: BigQueryConnection does not support transactions.

cursor(self)[source]¶: Return a new Cursor object using the connection.

rollback(self)[source]¶

class airflow.contrib.hooks.bigquery_hook.BigQueryBaseCursor(service, project_id, use_legacy_sql=True, api_resource_configs=None, location=None, num_retries=5)[source]¶

Bases: airflow.utils.log.logging_mixin.LoggingMixin

The BigQuery base cursor contains helper methods to execute queries against BigQuery. The methods can be used directly by operators, in cases where a PEP 249 cursor isn’t needed.

create_empty_table(self, project_id, dataset_id, table_id, schema_fields=None, time_partitioning=None, cluster_fields=None, labels=None, view=None, encryption_configuration=None, num_retries=5)[source]¶

Creates a new, empty table in the dataset. To create a view, which is defined by a SQL query, parse a dictionary to ‘view’ kwarg

Parameters

project_id (str) – The project to create the table into.
dataset_id (str) – The dataset to create the table into.
table_id (str) – The Name of the table to be created.
schema_fields (list) – If set, the schema field list as defined here: https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.load.schema
labels (dict) – a dictionary containing labels for the table, passed to BigQuery

Example:

schema_fields=[{"name": "emp_name", "type": "STRING", "mode": "REQUIRED"},
               {"name": "salary", "type": "INTEGER", "mode": "NULLABLE"}]

Parameters

time_partitioning (dict) –
configure optional time partitioning fields i.e. partition by field, type and expiration as per API specifications.

See also

https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#timePartitioning
cluster_fields (list) – [Optional] The fields used for clustering. Must be specified with time_partitioning, data in the table will be first partitioned and subsequently clustered. https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#clustering.fields
view (dict) – [Optional] A dictionary containing definition for the view. If set, it will create a view instead of a table: https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#view

Example:

view = {
    "query": "SELECT * FROM `test-project-id.test_dataset_id.test_table_prefix*` LIMIT 1000",
    "useLegacySql": False
}

Parameters

encryption_configuration (dict) –

[Optional] Custom encryption configuration (e.g., Cloud KMS keys). Example:

encryption_configuration = {
    "kmsKeyName": "projects/testp/locations/us/keyRings/test-kr/cryptoKeys/test-key"
}

Returns

None

create_external_table(self, external_project_dataset_table, schema_fields, source_uris, source_format='CSV', autodetect=False, compression='NONE', ignore_unknown_values=False, max_bad_records=0, skip_leading_rows=0, field_delimiter=', ', quote_character=None, allow_quoted_newlines=False, allow_jagged_rows=False, encoding='UTF-8', src_fmt_configs=None, labels=None, encryption_configuration=None)[source]¶

Creates a new external table in the dataset with the data in Google Cloud Storage. See here:

https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#resource

for more details about these parameters.

Parameters

external_project_dataset_table (str) – The dotted (<project>.|<project>:)<dataset>.<table>($<partition>) BigQuery table name to create external table. If <project> is not included, project will be the project defined in the connection json.
schema_fields (list) – The schema field list as defined here: https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#resource
source_uris (list) – The source Google Cloud Storage URI (e.g. gs://some-bucket/some-file.txt). A single wild per-object name can be used.
source_format (str) – File format to export.
autodetect (bool) – Try to detect schema and format options automatically. Any option specified explicitly will be honored.
compression (str) – [Optional] The compression type of the data source. Possible values include GZIP and NONE. The default value is NONE. This setting is ignored for Google Cloud Bigtable, Google Cloud Datastore backups and Avro formats.
ignore_unknown_values (bool) – [Optional] Indicates if BigQuery should allow extra values that are not represented in the table schema. If true, the extra values are ignored. If false, records with extra columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result.
max_bad_records (int) – The maximum number of bad records that BigQuery can ignore when running the job.
skip_leading_rows (int) – Number of rows to skip when loading from a CSV.
field_delimiter (str) – The delimiter to use when loading from a CSV.
quote_character (str) – The value that is used to quote data sections in a CSV file.
allow_quoted_newlines (bool) – Whether to allow quoted newlines (true) or not (false).
allow_jagged_rows (bool) – Accept rows that are missing trailing optional columns. The missing values are treated as nulls. If false, records with missing trailing columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result. Only applicable when soure_format is CSV.
encoding (str) –
The character encoding of the data. See:

See also

https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#externalDataConfiguration.csvOptions.encoding
src_fmt_configs (dict) – configure optional fields specific to the source format
labels (dict) – a dictionary containing labels for the table, passed to BigQuery

encryption_configuration (dict) –

[Optional] Custom encryption configuration (e.g., Cloud KMS keys). Example:

encryption_configuration = {
    "kmsKeyName": "projects/testp/locations/us/keyRings/test-kr/cryptoKeys/test-key"
}

patch_table(self, dataset_id, table_id, project_id=None, description=None, expiration_time=None, external_data_configuration=None, friendly_name=None, labels=None, schema=None, time_partitioning=None, view=None, require_partition_filter=None, encryption_configuration=None)[source]¶

Patch information in an existing table. It only updates fileds that are provided in the request object.

Reference: https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/patch

Parameters

dataset_id (str) – The dataset containing the table to be patched.
table_id (str) – The Name of the table to be patched.
project_id (str) – The project containing the table to be patched.
description (str) – [Optional] A user-friendly description of this table.
expiration_time (int) – [Optional] The time when this table expires, in milliseconds since the epoch.
external_data_configuration (dict) – [Optional] A dictionary containing properties of a table stored outside of BigQuery.
friendly_name (str) – [Optional] A descriptive name for this table.
labels (dict) – [Optional] A dictionary containing labels associated with this table.
schema (list) –
[Optional] If set, the schema field list as defined here: https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.load.schema The supported schema modifications and unsupported schema modification are listed here: https://cloud.google.com/bigquery/docs/managing-table-schemas Example:
```
schema=[{"name": "emp_name", "type": "STRING", "mode": "REQUIRED"},
               {"name": "salary", "type": "INTEGER", "mode": "NULLABLE"}]
```
time_partitioning (dict) – [Optional] A dictionary containing time-based partitioning definition for the table.
view (dict) –
[Optional] A dictionary containing definition for the view. If set, it will patch a view instead of a table: https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#view Example:
```
view = {
    "query": "SELECT * FROM `test-project-id.test_dataset_id.test_table_prefix*` LIMIT 500",
    "useLegacySql": False
}
```
require_partition_filter (bool) – [Optional] If true, queries over the this table require a partition filter. If false, queries over the table

encryption_configuration (dict) –

[Optional] Custom encryption configuration (e.g., Cloud KMS keys). Example:

encryption_configuration = {
    "kmsKeyName": "projects/testp/locations/us/keyRings/test-kr/cryptoKeys/test-key"
}

run_query(self, bql=None, sql=None, destination_dataset_table=None, write_disposition='WRITE_EMPTY', allow_large_results=False, flatten_results=None, udf_config=None, use_legacy_sql=None, maximum_billing_tier=None, maximum_bytes_billed=None, create_disposition='CREATE_IF_NEEDED', query_params=None, labels=None, schema_update_options=None, priority='INTERACTIVE', time_partitioning=None, api_resource_configs=None, cluster_fields=None, location=None, encryption_configuration=None)[source]¶

Executes a BigQuery SQL query. Optionally persists results in a BigQuery table. See here:

https://cloud.google.com/bigquery/docs/reference/v2/jobs

For more details about these parameters.

Parameters

bql (str) – (Deprecated. Use sql parameter instead) The BigQuery SQL to execute.
sql (str) – The BigQuery SQL to execute.
destination_dataset_table (str) – The dotted <dataset>.<table> BigQuery table to save the query results.
write_disposition (str) – What to do if the table already exists in BigQuery.
allow_large_results (bool) – Whether to allow large results.
flatten_results (bool) – If true and query uses legacy SQL dialect, flattens all nested and repeated fields in the query results. allowLargeResults must be true if this is set to false. For standard SQL queries, this flag is ignored and results are never flattened.
udf_config (list) – The User Defined Function configuration for the query. See https://cloud.google.com/bigquery/user-defined-functions for details.
use_legacy_sql (bool) – Whether to use legacy SQL (true) or standard SQL (false). If None, defaults to self.use_legacy_sql.
api_resource_configs (dict) – a dictionary that contain params ‘configuration’ applied for Google BigQuery Jobs API: https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs for example, {‘query’: {‘useQueryCache’: False}}. You could use it if you need to provide some params that are not supported by the BigQueryHook like args.
maximum_billing_tier (int) – Positive integer that serves as a multiplier of the basic price.
maximum_bytes_billed (float) – Limits the bytes billed for this job. Queries that will have bytes billed beyond this limit will fail (without incurring a charge). If unspecified, this will be set to your project default.
create_disposition (str) – Specifies whether the job is allowed to create new tables.
query_params (list) – a list of dictionary containing query parameter types and values, passed to BigQuery
labels (dict) – a dictionary containing labels for the job/query, passed to BigQuery
schema_update_options (Union[list, tuple, set]) – Allows the schema of the destination table to be updated as a side effect of the query job.
priority (str) – Specifies a priority for the query. Possible values include INTERACTIVE and BATCH. The default value is INTERACTIVE.
time_partitioning (dict) – configure optional time partitioning fields i.e. partition by field, type and expiration as per API specifications.
cluster_fields (list[str]) – Request that the result of this query be stored sorted by one or more columns. This is only available in combination with time_partitioning. The order of columns given determines the sort order.
location (str) – The geographic location of the job. Required except for US and EU. See details at https://cloud.google.com/bigquery/docs/locations#specifying_your_location

encryption_configuration (dict) –

[Optional] Custom encryption configuration (e.g., Cloud KMS keys). Example:

encryption_configuration = {
    "kmsKeyName": "projects/testp/locations/us/keyRings/test-kr/cryptoKeys/test-key"
}

run_extract(self, source_project_dataset_table, destination_cloud_storage_uris, compression='NONE', export_format='CSV', field_delimiter=', ', print_header=True, labels=None)[source]¶

Executes a BigQuery extract command to copy data from BigQuery to Google Cloud Storage. See here:

https://cloud.google.com/bigquery/docs/reference/v2/jobs

For more details about these parameters.

Parameters

source_project_dataset_table (str) – The dotted <dataset>.<table> BigQuery table to use as the source data.
destination_cloud_storage_uris (list) – The destination Google Cloud Storage URI (e.g. gs://some-bucket/some-file.txt). Follows convention defined here: https://cloud.google.com/bigquery/exporting-data-from-bigquery#exportingmultiple
compression (str) – Type of compression to use.
export_format (str) – File format to export.
field_delimiter (str) – The delimiter to use when extracting to a CSV.
print_header (bool) – Whether to print a header for a CSV file extract.
labels (dict) – a dictionary containing labels for the job/query, passed to BigQuery

run_copy(self, source_project_dataset_tables, destination_project_dataset_table, write_disposition='WRITE_EMPTY', create_disposition='CREATE_IF_NEEDED', labels=None, encryption_configuration=None)[source]¶

Executes a BigQuery copy command to copy data from one BigQuery table to another. See here:

https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.copy

For more details about these parameters.

Parameters

source_project_dataset_tables (list|string) – One or more dotted (project:|project.)<dataset>.<table> BigQuery tables to use as the source data. Use a list if there are multiple source tables. If <project> is not included, project will be the project defined in the connection json.
destination_project_dataset_table (str) – The destination BigQuery table. Format is: (project:|project.)<dataset>.<table>
write_disposition (str) – The write disposition if the table already exists.
create_disposition (str) – The create disposition if the table doesn’t exist.
labels (dict) – a dictionary containing labels for the job/query, passed to BigQuery

encryption_configuration (dict) –

[Optional] Custom encryption configuration (e.g., Cloud KMS keys). Example:

encryption_configuration = {
    "kmsKeyName": "projects/testp/locations/us/keyRings/test-kr/cryptoKeys/test-key"
}

run_load(self, destination_project_dataset_table, source_uris, schema_fields=None, source_format='CSV', create_disposition='CREATE_IF_NEEDED', skip_leading_rows=0, write_disposition='WRITE_EMPTY', field_delimiter=', ', max_bad_records=0, quote_character=None, ignore_unknown_values=False, allow_quoted_newlines=False, allow_jagged_rows=False, encoding='UTF-8', schema_update_options=None, src_fmt_configs=None, time_partitioning=None, cluster_fields=None, autodetect=False, encryption_configuration=None)[source]¶

Executes a BigQuery load command to load data from Google Cloud Storage to BigQuery. See here:

https://cloud.google.com/bigquery/docs/reference/v2/jobs

For more details about these parameters.

Parameters

destination_project_dataset_table (str) – The dotted (<project>.|<project>:)<dataset>.<table>($<partition>) BigQuery table to load data into. If <project> is not included, project will be the project defined in the connection json. If a partition is specified the operator will automatically append the data, create a new partition or create a new DAY partitioned table.
schema_fields (list) – The schema field list as defined here: https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.load Required if autodetect=False; optional if autodetect=True.
autodetect (bool) – Attempt to autodetect the schema for CSV and JSON source files.
source_uris (list) – The source Google Cloud Storage URI (e.g. gs://some-bucket/some-file.txt). A single wild per-object name can be used.
source_format (str) – File format to export.
create_disposition (str) – The create disposition if the table doesn’t exist.
skip_leading_rows (int) – Number of rows to skip when loading from a CSV.
write_disposition (str) – The write disposition if the table already exists.
field_delimiter (str) – The delimiter to use when loading from a CSV.
max_bad_records (int) – The maximum number of bad records that BigQuery can ignore when running the job.
quote_character (str) – The value that is used to quote data sections in a CSV file.
ignore_unknown_values (bool) – [Optional] Indicates if BigQuery should allow extra values that are not represented in the table schema. If true, the extra values are ignored. If false, records with extra columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result.
allow_quoted_newlines (bool) – Whether to allow quoted newlines (true) or not (false).
allow_jagged_rows (bool) – Accept rows that are missing trailing optional columns. The missing values are treated as nulls. If false, records with missing trailing columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result. Only applicable when soure_format is CSV.
encoding (str) –
The character encoding of the data.

See also

https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#externalDataConfiguration.csvOptions.encoding
schema_update_options (Union[list, tuple, set]) – Allows the schema of the destination table to be updated as a side effect of the load job.
src_fmt_configs (dict) – configure optional fields specific to the source format
time_partitioning (dict) – configure optional time partitioning fields i.e. partition by field, type and expiration as per API specifications.
cluster_fields (list[str]) – Request that the result of this load be stored sorted by one or more columns. This is only available in combination with time_partitioning. The order of columns given determines the sort order.

encryption_configuration (dict) –

[Optional] Custom encryption configuration (e.g., Cloud KMS keys). Example:

encryption_configuration = {
    "kmsKeyName": "projects/testp/locations/us/keyRings/test-kr/cryptoKeys/test-key"
}

run_with_configuration(self, configuration)[source]¶

Executes a BigQuery SQL query. See here:

https://cloud.google.com/bigquery/docs/reference/v2/jobs

For more details about the configuration parameter.

Parameters: configuration – The configuration parameter maps directly to BigQuery’s configuration field in the job object. See https://cloud.google.com/bigquery/docs/reference/v2/jobs for details.

poll_job_complete(self, job_id)[source]¶

cancel_query(self)[source]¶: Cancel all started queries that have not yet completed

get_schema(self, dataset_id, table_id)[source]¶

Get the schema for a given datset.table. see https://cloud.google.com/bigquery/docs/reference/v2/tables#resource

Parameters

dataset_id – the dataset ID of the requested table
table_id – the table ID of the requested table

Returns

a table schema

get_tabledata(self, dataset_id, table_id, max_results=None, selected_fields=None, page_token=None, start_index=None)[source]¶

Get the data of a given dataset.table and optionally with selected columns. see https://cloud.google.com/bigquery/docs/reference/v2/tabledata/list

Parameters

dataset_id – the dataset ID of the requested table.
table_id – the table ID of the requested table.
max_results – the maximum results to return.
selected_fields – List of fields to return (comma-separated). If unspecified, all fields are returned.
page_token – page token, returned from a previous call, identifying the result set.
start_index – zero based index of the starting row to read.

Returns

map containing the requested rows.

run_table_delete(self, deletion_dataset_table, ignore_if_missing=False)[source]¶

Delete an existing table from the dataset; If the table does not exist, return an error unless ignore_if_missing is set to True.

Parameters

deletion_dataset_table (str) – A dotted (<project>.|<project>:)<dataset>.<table> that indicates which table will be deleted.
ignore_if_missing (bool) – if True, then return success even if the requested table does not exist.

Returns

run_table_upsert(self, dataset_id, table_resource, project_id=None)[source]¶

creates a new, empty table in the dataset; If the table already exists, update the existing table. Since BigQuery does not natively allow table upserts, this is not an atomic operation.

Parameters

dataset_id (str) – the dataset to upsert the table into.
table_resource (dict) – a table resource. see https://cloud.google.com/bigquery/docs/reference/v2/tables#resource
project_id – the project to upsert the table into. If None, project will be self.project_id.

Returns

run_grant_dataset_view_access(self, source_dataset, view_dataset, view_table, source_project=None, view_project=None)[source]¶

Grant authorized view access of a dataset to a view table. If this view has already been granted access to the dataset, do nothing. This method is not atomic. Running it may clobber a simultaneous update.

Parameters

source_dataset (str) – the source dataset
view_dataset (str) – the dataset that the view is in
view_table (str) – the table of the view
source_project (str) – the project of the source dataset. If None, self.project_id will be used.
view_project (str) – the project that the view is in. If None, self.project_id will be used.

Returns

the datasets resource of the source dataset.

create_empty_dataset(self, dataset_id='', project_id='', dataset_reference=None)[source]¶

Create a new empty dataset: https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets/insert

Parameters

project_id (str) – The name of the project where we want to create an empty a dataset. Don’t need to provide, if projectId in dataset_reference.
dataset_id (str) – The id of dataset. Don’t need to provide, if datasetId in dataset_reference.
dataset_reference (dict) – Dataset reference that could be provided with request body. More info: https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets#resource

delete_dataset(self, project_id, dataset_id, delete_contents=False)[source]¶

Delete a dataset of Big query in your project.

Parameters

project_id (str) – The name of the project where we have the dataset .
dataset_id (str) – The dataset to be delete.
delete_contents (bool) – [Optional] Whether to force the deletion even if the dataset is not empty. Will delete all tables (if any) in the dataset if set to True. Will raise HttpError 400: “{dataset_id} is still in use” if set to False and dataset is not empty. The default value is False.

Returns

get_dataset(self, dataset_id, project_id=None)[source]¶

Method returns dataset_resource if dataset exist and raised 404 error if dataset does not exist

Parameters

dataset_id (str) – The BigQuery Dataset ID
project_id (str) – The GCP Project ID

Returns

dataset_resource

airflow.contrib.hooks.bigquery_hook¶

Module Contents¶

`airflow.contrib.hooks.bigquery_hook`¶