airflow.providers.google.cloud.hooks.bigquery
¶
BigQuery Hook and a very basic PEP 249 implementation for BigQuery.
Module Contents¶
Classes¶
Interact with BigQuery. |
|
BigQuery connection. |
|
BigQuery cursor. |
|
A very basic BigQuery PEP 249 cursor implementation. |
|
Uses gcloud-aio library to retrieve Job details. |
|
Async hook for BigQuery Table. |
Functions¶
|
Attributes¶
- class airflow.providers.google.cloud.hooks.bigquery.BigQueryHook(gcp_conn_id=GoogleBaseHook.default_conn_name, use_legacy_sql=True, location=None, priority='INTERACTIVE', api_resource_configs=None, impersonation_chain=None, labels=None, **kwargs)[source]¶
Bases:
airflow.providers.google.common.hooks.base_google.GoogleBaseHook
,airflow.providers.common.sql.hooks.sql.DbApiHook
Interact with BigQuery.
This hook uses the Google Cloud connection.
- Parameters
gcp_conn_id (str) – The Airflow connection used for GCP credentials.
use_legacy_sql (bool) – This specifies whether to use legacy SQL dialect.
location (str | None) – The location of the BigQuery resource.
priority (str) – Specifies a priority for the query. Possible values include INTERACTIVE and BATCH. The default value is INTERACTIVE.
api_resource_configs (dict | None) – This contains params configuration applied for Google BigQuery jobs.
impersonation_chain (str | Sequence[str] | None) – This is the optional service account to impersonate using short term credentials.
labels (dict | None) – The BigQuery resource label.
- get_sqlalchemy_engine(engine_kwargs=None)[source]¶
Create an SQLAlchemy engine object.
- Parameters
engine_kwargs – Kwargs used in
create_engine()
.
- get_records(sql, parameters=None)[source]¶
Execute the sql and return a set of records.
- Parameters
sql – the sql statement to be executed (str) or a list of sql statements to execute
parameters – The parameters to render the SQL query with.
- abstract insert_rows(table, rows, target_fields=None, commit_every=1000, replace=False, **kwargs)[source]¶
Insert rows.
Insertion is currently unsupported. Theoretically, you could use BigQuery’s streaming API to insert rows into a table, but this hasn’t been implemented.
- get_pandas_df(sql, parameters=None, dialect=None, **kwargs)[source]¶
Get a Pandas DataFrame for the BigQuery results.
The DbApiHook method must be overridden because Pandas doesn’t support PEP 249 connections, except for SQLite.
See also
https://github.com/pandas-dev/pandas/blob/055d008615272a1ceca9720dc365a2abd316f353/pandas/io/sql.py#L415 https://github.com/pandas-dev/pandas/issues/6900
- Parameters
sql (str) – The BigQuery SQL to execute.
parameters (Iterable | Mapping[str, Any] | None) – The parameters to render the SQL query with (not used, leave to override superclass method)
dialect (str | None) – Dialect of BigQuery SQL – legacy SQL or standard SQL defaults to use self.use_legacy_sql if not specified
kwargs – (optional) passed into pandas_gbq.read_gbq method
- table_exists(dataset_id, table_id, project_id)[source]¶
Check if a table exists in Google BigQuery.
- Parameters
- table_partition_exists(dataset_id, table_id, partition_id, project_id)[source]¶
Check if a partition exists in Google BigQuery.
- Parameters
project_id (str) – The Google cloud project in which to look for the table. The connection supplied to the hook must provide access to the specified project.
dataset_id (str) – The name of the dataset in which to look for the table.
table_id (str) – The name of the table to check the existence of.
partition_id (str) – The name of the partition to check the existence of.
- create_empty_table(project_id=None, dataset_id=None, table_id=None, table_resource=None, schema_fields=None, time_partitioning=None, cluster_fields=None, labels=None, view=None, materialized_view=None, encryption_configuration=None, retry=DEFAULT_RETRY, location=None, exists_ok=True)[source]¶
Create a new, empty table in the dataset.
To create a view, which is defined by a SQL query, parse a dictionary to the view argument.
- Parameters
project_id (str | None) – The project to create the table into.
dataset_id (str | None) – The dataset to create the table into.
table_id (str | None) – The Name of the table to be created.
table_resource (dict[str, Any] | None) – Table resource as described in documentation: https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#Table If provided all other parameters are ignored.
schema_fields (list | None) –
If set, the schema field list as defined here: https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.load.schema
schema_fields = [ {"name": "emp_name", "type": "STRING", "mode": "REQUIRED"}, {"name": "salary", "type": "INTEGER", "mode": "NULLABLE"}, ]
labels (dict | None) – a dictionary containing labels for the table, passed to BigQuery
retry (google.api_core.retry.Retry) – Optional. How to retry the RPC.
time_partitioning (dict | None) –
configure optional time partitioning fields i.e. partition by field, type and expiration as per API specifications.
cluster_fields (list[str] | None) – [Optional] The fields used for clustering. BigQuery supports clustering for both partitioned and non-partitioned tables. https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#clustering.fields
view (dict | None) –
[Optional] A dictionary containing definition for the view. If set, it will create a view instead of a table: https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#ViewDefinition
view = { "query": "SELECT * FROM `test-project-id.test_dataset_id.test_table_prefix*` LIMIT 1000", "useLegacySql": False, }
materialized_view (dict | None) – [Optional] The materialized view definition.
encryption_configuration (dict | None) –
[Optional] Custom encryption configuration (e.g., Cloud KMS keys).
encryption_configuration = { "kmsKeyName": "projects/testp/locations/us/keyRings/test-kr/cryptoKeys/test-key", }
num_retries – Maximum number of retries in case of connection problems.
location (str | None) – (Optional) The geographic location where the table should reside.
exists_ok (bool) – If
True
, ignore “already exists” errors when creating the table.
- Returns
Created table
- Return type
- create_empty_dataset(dataset_id=None, project_id=None, location=None, dataset_reference=None, exists_ok=True)[source]¶
Create a new empty dataset.
- Parameters
project_id (str | None) – The name of the project where we want to create an empty a dataset. Don’t need to provide, if projectId in dataset_reference.
dataset_id (str | None) – The id of dataset. Don’t need to provide, if datasetId in dataset_reference.
location (str | None) – (Optional) The geographic location where the dataset should reside. There is no default value but the dataset will be created in US if nothing is provided.
dataset_reference (dict[str, Any] | None) – Dataset reference that could be provided with request body. More info: https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets#resource
exists_ok (bool) – If
True
, ignore “already exists” errors when creating the dataset.
- get_dataset_tables(dataset_id, project_id=None, max_results=None, retry=DEFAULT_RETRY)[source]¶
Get the list of tables for a given dataset.
For more information, see: https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/list
- Parameters
dataset_id (str) – the dataset ID of the requested dataset.
project_id (str | None) – (Optional) the project of the requested dataset. If None, self.project_id will be used.
max_results (int | None) – (Optional) the maximum number of tables to return.
retry (google.api_core.retry.Retry) – How to retry the RPC.
- Returns
List of tables associated with the dataset.
- Return type
- delete_dataset(dataset_id, project_id=None, delete_contents=False, retry=DEFAULT_RETRY)[source]¶
Delete a dataset of Big query in your project.
- Parameters
project_id (str | None) – The name of the project where we have the dataset.
dataset_id (str) – The dataset to be delete.
delete_contents (bool) – If True, delete all the tables in the dataset. If False and the dataset contains tables, the request will fail.
retry (google.api_core.retry.Retry) – How to retry the RPC.
- create_external_table(external_project_dataset_table, schema_fields, source_uris, source_format='CSV', autodetect=False, compression='NONE', ignore_unknown_values=False, max_bad_records=0, skip_leading_rows=0, field_delimiter=',', quote_character=None, allow_quoted_newlines=False, allow_jagged_rows=False, encoding='UTF-8', src_fmt_configs=None, labels=None, description=None, encryption_configuration=None, location=None, project_id=None)[source]¶
Create an external table in the dataset with data from Google Cloud Storage.
This method is deprecated. Please use
create_empty_table()
with thetable_resource
object. See function documentation for more details about these parameters.- Parameters
external_project_dataset_table (str) – The dotted
(<project>.|<project>:)<dataset>.<table>($<partition>)
BigQuery table name to create external table. If<project>
is not included, project will be the project defined in the connection json.schema_fields (list) – The schema field list as defined here: https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#resource
source_uris (list) – The source Google Cloud Storage URI (e.g. gs://some-bucket/some-file.txt). A single wild per-object name can be used.
source_format (str) – File format to export.
autodetect (bool) – Try to detect schema and format options automatically. Any option specified explicitly will be honored.
compression (str) – [Optional] The compression type of the data source. Possible values include GZIP and NONE. The default value is NONE. This setting is ignored for Google Cloud Bigtable, Google Cloud Datastore backups and Avro formats.
ignore_unknown_values (bool) – [Optional] Indicates if BigQuery should allow extra values that are not represented in the table schema. If true, the extra values are ignored. If false, records with extra columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result.
max_bad_records (int) – The maximum number of bad records that BigQuery can ignore when running the job.
skip_leading_rows (int) – Number of rows to skip when loading from a CSV.
field_delimiter (str) – The delimiter to use when loading from a CSV.
quote_character (str | None) – The value that is used to quote data sections in a CSV file.
allow_quoted_newlines (bool) – Whether to allow quoted newlines (true) or not (false).
allow_jagged_rows (bool) – Accept rows that are missing trailing optional columns. The missing values are treated as nulls. If false, records with missing trailing columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result. Only applicable when source_format is CSV.
encoding (str) –
The character encoding of the data. See:
src_fmt_configs (dict | None) – configure optional fields specific to the source format
labels (dict | None) – A dictionary containing labels for the BiqQuery table.
description (str | None) – A string containing the description for the BigQuery table.
encryption_configuration (dict | None) –
[Optional] Custom encryption configuration (e.g., Cloud KMS keys).
encryption_configuration = { "kmsKeyName": "projects/testp/locations/us/keyRings/test-kr/cryptoKeys/test-key", }
- update_table(table_resource, fields=None, dataset_id=None, table_id=None, project_id=None)[source]¶
Change some fields of a table.
Use
fields
to specify which fields to update. At least one field must be provided. If a field is listed infields
and isNone
intable
, the field value will be deleted.If
table.etag
is notNone
, the update will only succeed if the table on the server has the same ETag. Thus reading a table withget_table
, changing its fields, and then passing it toupdate_table
will ensure that the changes will only be saved if no modifications to the table occurred since the read.- Parameters
project_id (str | None) – The project to create the table into.
dataset_id (str | None) – The dataset to create the table into.
table_id (str | None) – The Name of the table to be created.
table_resource (dict[str, Any]) – Table resource as described in documentation: https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#Table The table has to contain
tableReference
orproject_id
,dataset_id
andtable_id
have to be provided.fields (list[str] | None) – The fields of
table
to change, spelled as the Table properties (e.g. “friendly_name”).
- patch_table(dataset_id, table_id, project_id=None, description=None, expiration_time=None, external_data_configuration=None, friendly_name=None, labels=None, schema=None, time_partitioning=None, view=None, require_partition_filter=None, encryption_configuration=None)[source]¶
Patch information in an existing table.
It only updates fields that are provided in the request object. This method is deprecated. Please use
update_table()
instead.Reference: https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/patch
- Parameters
dataset_id (str) – The dataset containing the table to be patched.
table_id (str) – The Name of the table to be patched.
project_id (str | None) – The project containing the table to be patched.
description (str | None) – [Optional] A user-friendly description of this table.
expiration_time (int | None) – [Optional] The time when this table expires, in milliseconds since the epoch.
external_data_configuration (dict | None) – [Optional] A dictionary containing properties of a table stored outside of BigQuery.
friendly_name (str | None) – [Optional] A descriptive name for this table.
labels (dict | None) – [Optional] A dictionary containing labels associated with this table.
schema (list | None) –
[Optional] If set, the schema field list as defined here: https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.load.schema The supported schema modifications and unsupported schema modification are listed here: https://cloud.google.com/bigquery/docs/managing-table-schemas
schema = [ {"name": "emp_name", "type": "STRING", "mode": "REQUIRED"}, {"name": "salary", "type": "INTEGER", "mode": "NULLABLE"}, ]
time_partitioning (dict | None) – [Optional] A dictionary containing time-based partitioning definition for the table.
view (dict | None) –
[Optional] A dictionary containing definition for the view. If set, it will patch a view instead of a table: https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#ViewDefinition
view = { "query": "SELECT * FROM `test-project-id.test_dataset_id.test_table_prefix*` LIMIT 500", "useLegacySql": False, }
require_partition_filter (bool | None) – [Optional] If true, queries over the this table require a partition filter. If false, queries over the table
encryption_configuration (dict | None) –
[Optional] Custom encryption configuration (e.g., Cloud KMS keys).
encryption_configuration = { "kmsKeyName": "projects/testp/locations/us/keyRings/test-kr/cryptoKeys/test-key", }
- insert_all(project_id, dataset_id, table_id, rows, ignore_unknown_values=False, skip_invalid_rows=False, fail_on_error=False)[source]¶
Stream data into BigQuery one record at a time without a load job.
See also
For more information, see: https://cloud.google.com/bigquery/docs/reference/rest/v2/tabledata/insertAll
- Parameters
project_id (str) – The name of the project where we have the table
dataset_id (str) – The name of the dataset where we have the table
table_id (str) – The name of the table
rows (list) –
the rows to insert
rows = [{"json": {"a_key": "a_value_0"}}, {"json": {"a_key": "a_value_1"}}]
ignore_unknown_values (bool) – [Optional] Accept rows that contain values that do not match the schema. The unknown values are ignored. The default value is false, which treats unknown values as errors.
skip_invalid_rows (bool) – [Optional] Insert all valid rows of a request, even if invalid rows exist. The default value is false, which causes the entire request to fail if any invalid rows exist.
fail_on_error (bool) – [Optional] Force the task to fail if any errors occur. The default value is false, which indicates the task should not fail even if any insertion errors occur.
- update_dataset(fields, dataset_resource, dataset_id=None, project_id=None, retry=DEFAULT_RETRY)[source]¶
Change some fields of a dataset.
Use
fields
to specify which fields to update. At least one field must be provided. If a field is listed infields
and isNone
indataset
, it will be deleted.If
dataset.etag
is notNone
, the update will only succeed if the dataset on the server has the same ETag. Thus reading a dataset withget_dataset
, changing its fields, and then passing it toupdate_dataset
will ensure that the changes will only be saved if no modifications to the dataset occurred since the read.- Parameters
dataset_resource (dict[str, Any]) – Dataset resource that will be provided in request body. https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets#resource
dataset_id (str | None) – The id of the dataset.
fields (Sequence[str]) – The properties of
dataset
to change (e.g. “friendly_name”).project_id (str | None) – The Google Cloud Project ID
retry (google.api_core.retry.Retry) – How to retry the RPC.
- patch_dataset(dataset_id, dataset_resource, project_id=None)[source]¶
Patches information in an existing dataset.
It only replaces fields that are provided in the submitted dataset resource.
This method is deprecated. Please use
update_dataset()
instead.More info: https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets/patch
- Parameters
dataset_id (str) – The BigQuery Dataset ID
dataset_resource (dict) – Dataset resource that will be provided in request body. https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets#resource
project_id (str | None) – The Google Cloud Project ID
- get_dataset_tables_list(dataset_id, project_id=None, table_prefix=None, max_results=None)[source]¶
List tables of a BigQuery dataset.
If a table prefix is specified, only tables beginning by it are returned. This method is deprecated. Please use
get_dataset_tables()
instead.For more information, see: https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/list
- Parameters
dataset_id (str) – The BigQuery Dataset ID
project_id (str | None) – The Google Cloud Project ID
table_prefix (str | None) – Tables must begin by this prefix to be returned (case sensitive)
max_results (int | None) – The maximum number of results to return in a single response page. Leverage the page tokens to iterate through the entire collection.
- Returns
List of tables associated with the dataset
- Return type
- get_datasets_list(project_id=None, include_all=False, filter_=None, max_results=None, page_token=None, retry=DEFAULT_RETRY, return_iterator=False)[source]¶
Get all BigQuery datasets in the current project.
For more information, see: https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets/list
- Parameters
project_id (str | None) – Google Cloud Project for which you try to get all datasets
include_all (bool) – True if results include hidden datasets. Defaults to False.
filter – An expression for filtering the results by label. For syntax, see https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets/list#filter.
filter – str
max_results (int | None) – Maximum number of datasets to return.
max_results – int
page_token (str | None) – Token representing a cursor into the datasets. If not passed, the API will return the first page of datasets. The token marks the beginning of the iterator to be returned and the value of the
page_token
can be accessed atnext_page_token
of theHTTPIterator
.page_token – str
retry (google.api_core.retry.Retry) – How to retry the RPC.
return_iterator (bool) – Instead of returning a list[Row], returns a HTTPIterator which can be used to obtain the next_page_token property.
- get_dataset(dataset_id, project_id=None)[source]¶
Fetch the dataset referenced by dataset_id.
- Parameters
- Returns
dataset_resource
- Return type
See also
For more information, see Dataset Resource content: https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets#resource
- run_grant_dataset_view_access(source_dataset, view_dataset, view_table, view_project=None, project_id=None)[source]¶
Grant authorized view access of a dataset to a view table.
If this view has already been granted access to the dataset, do nothing. This method is not atomic. Running it may clobber a simultaneous update.
- Parameters
source_dataset (str) – the source dataset
view_dataset (str) – the dataset that the view is in
view_table (str) – the table of the view
project_id (str | None) – the project of the source dataset. If None, self.project_id will be used.
view_project (str | None) – the project that the view is in. If None, self.project_id will be used.
- Returns
the datasets resource of the source dataset.
- Return type
- run_table_upsert(dataset_id, table_resource, project_id=None)[source]¶
Update a table if it exists, otherwise create a new one.
Since BigQuery does not natively allow table upserts, this is not an atomic operation.
- Parameters
dataset_id (str) – the dataset to upsert the table into.
table_resource (dict[str, Any]) – a table resource. see https://cloud.google.com/bigquery/docs/reference/v2/tables#resource
project_id (str | None) – the project to upsert the table into. If None, project will be self.project_id.
- run_table_delete(deletion_dataset_table, ignore_if_missing=False)[source]¶
Delete an existing table from the dataset.
If the table does not exist, return an error unless ignore_if_missing is set to True.
This method is deprecated. Please use
delete_table()
instead.
- delete_table(table_id, not_found_ok=True, project_id=None)[source]¶
Delete an existing table from the dataset.
If the table does not exist, return an error unless not_found_ok is set to True.
- get_tabledata(dataset_id, table_id, max_results=None, selected_fields=None, page_token=None, start_index=None)[source]¶
Get data from given table.
This method is deprecated. Please use
list_rows()
instead.- Parameters
dataset_id (str) – the dataset ID of the requested table.
table_id (str) – the table ID of the requested table.
max_results (int | None) – the maximum results to return.
selected_fields (str | None) – List of fields to return (comma-separated). If unspecified, all fields are returned.
page_token (str | None) – page token, returned from a previous call, identifying the result set.
start_index (int | None) – zero based index of the starting row to read.
- Returns
list of rows
- Return type
- list_rows(dataset_id, table_id, max_results=None, selected_fields=None, page_token=None, start_index=None, project_id=None, location=None, retry=DEFAULT_RETRY, return_iterator=False)[source]¶
List rows in a table.
See https://cloud.google.com/bigquery/docs/reference/rest/v2/tabledata/list
- Parameters
dataset_id (str) – the dataset ID of the requested table.
table_id (str) – the table ID of the requested table.
max_results (int | None) – the maximum results to return.
selected_fields (list[str] | str | None) – List of fields to return (comma-separated). If unspecified, all fields are returned.
page_token (str | None) – page token, returned from a previous call, identifying the result set.
start_index (int | None) – zero based index of the starting row to read.
project_id (str | None) – Project ID for the project which the client acts on behalf of.
location (str | None) – Default location for job.
retry (google.api_core.retry.Retry) – How to retry the RPC.
return_iterator (bool) – Instead of returning a list[Row], returns a RowIterator which can be used to obtain the next_page_token property.
- Returns
list of rows
- Return type
list[google.cloud.bigquery.table.Row] | google.cloud.bigquery.table.RowIterator
- get_schema(dataset_id, table_id, project_id=None)[source]¶
Get the schema for a given dataset and table.
- Parameters
- Returns
a table schema
- Return type
- update_table_schema(schema_fields_updates, include_policy_tags, dataset_id, table_id, project_id=None)[source]¶
Update fields within a schema for a given dataset and table.
Note that some fields in schemas are immutable; trying to change them will cause an exception.
If a new field is included, it will be inserted, which requires all required fields to be set.
- Parameters
include_policy_tags (bool) – If set to True policy tags will be included in the update request which requires special permissions even if unchanged see https://cloud.google.com/bigquery/docs/column-level-security#roles
dataset_id (str) – the dataset ID of the requested table to be updated
table_id (str) – the table ID of the table to be updated
schema_fields_updates (list[dict[str, Any]]) –
a partial schema resource. See https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#TableSchema
schema_fields_updates = [ {"name": "emp_name", "description": "Some New Description"}, {"name": "salary", "description": "Some New Description"}, { "name": "departments", "fields": [ {"name": "name", "description": "Some New Description"}, {"name": "type", "description": "Some New Description"}, ], }, ]
project_id (str | None) – The name of the project where we want to update the table.
- poll_job_complete(job_id, project_id=None, location=None, retry=DEFAULT_RETRY)[source]¶
Check if jobs have completed.
- Parameters
job_id (str) – id of the job.
project_id (str | None) – Google Cloud Project where the job is running
location (str | None) – location the job is running
retry (google.api_core.retry.Retry) – How to retry the RPC.
- cancel_job(job_id, project_id=None, location=None)[source]¶
Cancel a job and wait for cancellation to complete.
- insert_job(configuration, job_id=None, project_id=None, location=None, nowait=False, retry=DEFAULT_RETRY, timeout=None)[source]¶
Execute a BigQuery job and wait for it to complete.
- Parameters
configuration (dict) – The configuration parameter maps directly to BigQuery’s configuration field in the job object. See https://cloud.google.com/bigquery/docs/reference/v2/jobs for details.
job_id (str | None) – The ID of the job. The ID must contain only letters (a-z, A-Z), numbers (0-9), underscores (_), or dashes (-). The maximum length is 1,024 characters. If not provided then uuid will be generated.
project_id (str | None) – Google Cloud Project where the job is running.
location (str | None) – Location the job is running.
nowait (bool) – Whether to insert job without waiting for the result.
retry (google.api_core.retry.Retry) – How to retry the RPC.
timeout (float | None) – The number of seconds to wait for the underlying HTTP transport before using
retry
.
- Returns
The job ID.
- Return type
BigQueryJob
- run_with_configuration(configuration)[source]¶
Execute a BigQuery SQL query.
This method is deprecated. Please use
insert_job()
instead.- Parameters
configuration (dict) – The configuration parameter maps directly to BigQuery’s configuration field in the job object. See https://cloud.google.com/bigquery/docs/reference/v2/jobs for details.
- run_load(destination_project_dataset_table, source_uris, schema_fields=None, source_format='CSV', create_disposition='CREATE_IF_NEEDED', skip_leading_rows=0, write_disposition='WRITE_EMPTY', field_delimiter=',', max_bad_records=0, quote_character=None, ignore_unknown_values=False, allow_quoted_newlines=False, allow_jagged_rows=False, encoding='UTF-8', schema_update_options=None, src_fmt_configs=None, time_partitioning=None, cluster_fields=None, autodetect=False, encryption_configuration=None, labels=None, description=None)[source]¶
Load data from Google Cloud Storage to BigQuery.
This method is deprecated. Please use
insert_job()
instead.- Parameters
destination_project_dataset_table (str) – The dotted
(<project>.|<project>:)<dataset>.<table>($<partition>)
BigQuery table to load data into. If<project>
is not included, project will be the project defined in the connection json. If a partition is specified the operator will automatically append the data, create a new partition or create a new DAY partitioned table.schema_fields (list | None) – The schema field list as defined here: https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.load Required if autodetect=False; optional if autodetect=True.
autodetect (bool) – Attempt to autodetect the schema for CSV and JSON source files.
source_uris (list) – The source Google Cloud Storage URI (e.g. gs://some-bucket/some-file.txt). A single wild per-object name can be used.
source_format (str) – File format to export.
create_disposition (str) – The create disposition if the table doesn’t exist.
skip_leading_rows (int) – Number of rows to skip when loading from a CSV.
write_disposition (str) – The write disposition if the table already exists.
field_delimiter (str) – The delimiter to use when loading from a CSV.
max_bad_records (int) – The maximum number of bad records that BigQuery can ignore when running the job.
quote_character (str | None) – The value that is used to quote data sections in a CSV file.
ignore_unknown_values (bool) – [Optional] Indicates if BigQuery should allow extra values that are not represented in the table schema. If true, the extra values are ignored. If false, records with extra columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result.
allow_quoted_newlines (bool) – Whether to allow quoted newlines (true) or not (false).
allow_jagged_rows (bool) – Accept rows that are missing trailing optional columns. The missing values are treated as nulls. If false, records with missing trailing columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result. Only applicable when source_format is CSV.
encoding (str) –
The character encoding of the data.
schema_update_options (Iterable | None) – Allows the schema of the destination table to be updated as a side effect of the load job.
src_fmt_configs (dict | None) – configure optional fields specific to the source format
time_partitioning (dict | None) – configure optional time partitioning fields i.e. partition by field, type and expiration as per API specifications.
cluster_fields (list | None) – Request that the result of this load be stored sorted by one or more columns. BigQuery supports clustering for both partitioned and non-partitioned tables. The order of columns given determines the sort order.
encryption_configuration (dict | None) –
[Optional] Custom encryption configuration (e.g., Cloud KMS keys).
encryption_configuration = { "kmsKeyName": "projects/testp/locations/us/keyRings/test-kr/cryptoKeys/test-key", }
labels (dict | None) – A dictionary containing labels for the BiqQuery table.
description (str | None) – A string containing the description for the BigQuery table.
- run_copy(source_project_dataset_tables, destination_project_dataset_table, write_disposition='WRITE_EMPTY', create_disposition='CREATE_IF_NEEDED', labels=None, encryption_configuration=None)[source]¶
Copy data from one BigQuery table to another.
This method is deprecated. Please use
insert_job()
instead.- Parameters
source_project_dataset_tables (list | str) – One or more dotted
(project:|project.)<dataset>.<table>
BigQuery tables to use as the source data. Use a list if there are multiple source tables. If<project>
is not included, project will be the project defined in the connection json.destination_project_dataset_table (str) – The destination BigQuery table. Format is:
(project:|project.)<dataset>.<table>
write_disposition (str) – The write disposition if the table already exists.
create_disposition (str) – The create disposition if the table doesn’t exist.
labels (dict | None) – a dictionary containing labels for the job/query, passed to BigQuery
encryption_configuration (dict | None) –
[Optional] Custom encryption configuration (e.g., Cloud KMS keys).
encryption_configuration = { "kmsKeyName": "projects/testp/locations/us/keyRings/test-kr/cryptoKeys/test-key", }
- run_extract(source_project_dataset_table, destination_cloud_storage_uris, compression='NONE', export_format='CSV', field_delimiter=',', print_header=True, labels=None, return_full_job=False)[source]¶
Copy data from BigQuery to Google Cloud Storage.
This method is deprecated. Please use
insert_job()
instead.- Parameters
source_project_dataset_table (str) – The dotted
<dataset>.<table>
BigQuery table to use as the source data.destination_cloud_storage_uris (list[str]) – The destination Google Cloud Storage URI (e.g. gs://some-bucket/some-file.txt). Follows convention defined here: https://cloud.google.com/bigquery/exporting-data-from-bigquery#exportingmultiple
compression (str) – Type of compression to use.
export_format (str) – File format to export.
field_delimiter (str) – The delimiter to use when extracting to a CSV.
print_header (bool) – Whether to print a header for a CSV file extract.
labels (dict | None) – a dictionary containing labels for the job/query, passed to BigQuery
return_full_job (bool) – return full job instead of job id only
- run_query(sql, destination_dataset_table=None, write_disposition='WRITE_EMPTY', allow_large_results=False, flatten_results=None, udf_config=None, use_legacy_sql=None, maximum_billing_tier=None, maximum_bytes_billed=None, create_disposition='CREATE_IF_NEEDED', query_params=None, labels=None, schema_update_options=None, priority=None, time_partitioning=None, api_resource_configs=None, cluster_fields=None, location=None, encryption_configuration=None)[source]¶
Execute a BigQuery SQL query.
Optionally persists results in a BigQuery table.
This method is deprecated. Please use
insert_job()
instead.For more details about these parameters.
- Parameters
sql (str) – The BigQuery SQL to execute.
destination_dataset_table (str | None) – The dotted
<dataset>.<table>
BigQuery table to save the query results.write_disposition (str) – What to do if the table already exists in BigQuery.
allow_large_results (bool) – Whether to allow large results.
flatten_results (bool | None) – If true and query uses legacy SQL dialect, flattens all nested and repeated fields in the query results.
allowLargeResults
must be true if this is set to false. For standard SQL queries, this flag is ignored and results are never flattened.udf_config (list | None) – The User Defined Function configuration for the query. See https://cloud.google.com/bigquery/user-defined-functions for details.
use_legacy_sql (bool | None) – Whether to use legacy SQL (true) or standard SQL (false). If None, defaults to self.use_legacy_sql.
api_resource_configs (dict | None) – a dictionary that contain params ‘configuration’ applied for Google BigQuery Jobs API: https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs for example, {‘query’: {‘useQueryCache’: False}}. You could use it if you need to provide some params that are not supported by the BigQueryHook like args.
maximum_billing_tier (int | None) – Positive integer that serves as a multiplier of the basic price.
maximum_bytes_billed (float | None) – Limits the bytes billed for this job. Queries that will have bytes billed beyond this limit will fail (without incurring a charge). If unspecified, this will be set to your project default.
create_disposition (str) – Specifies whether the job is allowed to create new tables.
query_params (list | None) – a list of dictionary containing query parameter types and values, passed to BigQuery
labels (dict | None) – a dictionary containing labels for the job/query, passed to BigQuery
schema_update_options (Iterable | None) – Allows the schema of the destination table to be updated as a side effect of the query job.
priority (str | None) – Specifies a priority for the query. Possible values include INTERACTIVE and BATCH. If None, defaults to self.priority.
time_partitioning (dict | None) – configure optional time partitioning fields i.e. partition by field, type and expiration as per API specifications.
cluster_fields (list[str] | None) – Request that the result of this query be stored sorted by one or more columns. BigQuery supports clustering for both partitioned and non-partitioned tables. The order of columns given determines the sort order.
location (str | None) – The geographic location of the job. Required except for US and EU. See details at https://cloud.google.com/bigquery/docs/locations#specifying_your_location
encryption_configuration (dict | None) –
[Optional] Custom encryption configuration (e.g., Cloud KMS keys).
encryption_configuration = { "kmsKeyName": "projects/testp/locations/us/keyRings/test-kr/cryptoKeys/test-key", }
- class airflow.providers.google.cloud.hooks.bigquery.BigQueryConnection(*args, **kwargs)[source]¶
BigQuery connection.
BigQuery does not have a notion of a persistent connection. Thus, these objects are small stateless factories for cursors, which do all the real work.
- class airflow.providers.google.cloud.hooks.bigquery.BigQueryBaseCursor(service, project_id, hook, use_legacy_sql=True, api_resource_configs=None, location=None, num_retries=5, labels=None)[source]¶
Bases:
airflow.utils.log.logging_mixin.LoggingMixin
BigQuery cursor.
The BigQuery base cursor contains helper methods to execute queries against BigQuery. The methods can be used directly by operators, in cases where a PEP 249 cursor isn’t needed.
- create_empty_table(*args, **kwargs)[source]¶
This method is deprecated.
Please use
create_empty_table()
instead.
- create_empty_dataset(*args, **kwargs)[source]¶
This method is deprecated.
Please use
create_empty_dataset()
instead.
- get_dataset_tables(*args, **kwargs)[source]¶
This method is deprecated.
Please use
get_dataset_tables()
instead.
- delete_dataset(*args, **kwargs)[source]¶
This method is deprecated.
Please use
delete_dataset()
instead.
- create_external_table(*args, **kwargs)[source]¶
This method is deprecated.
Please use
create_external_table()
instead.
- patch_table(*args, **kwargs)[source]¶
This method is deprecated.
Please use
patch_table()
instead.
- insert_all(*args, **kwargs)[source]¶
This method is deprecated.
Please use
insert_all()
instead.
- update_dataset(*args, **kwargs)[source]¶
This method is deprecated.
Please use
update_dataset()
instead.
- patch_dataset(*args, **kwargs)[source]¶
This method is deprecated.
Please use
patch_dataset()
instead.
- get_dataset_tables_list(*args, **kwargs)[source]¶
This method is deprecated.
Please use
get_dataset_tables_list()
instead.
- get_datasets_list(*args, **kwargs)[source]¶
This method is deprecated.
Please use
get_datasets_list()
instead.
- get_dataset(*args, **kwargs)[source]¶
This method is deprecated.
Please use
get_dataset()
instead.
- run_grant_dataset_view_access(*args, **kwargs)[source]¶
This method is deprecated.
Please use
run_grant_dataset_view_access()
instead.
- run_table_upsert(*args, **kwargs)[source]¶
This method is deprecated.
Please use
run_table_upsert()
instead.
- run_table_delete(*args, **kwargs)[source]¶
This method is deprecated.
Please use
run_table_delete()
instead.
- get_tabledata(*args, **kwargs)[source]¶
This method is deprecated.
Please use
get_tabledata()
instead.
- get_schema(*args, **kwargs)[source]¶
This method is deprecated.
Please use
get_schema()
instead.
- poll_job_complete(*args, **kwargs)[source]¶
This method is deprecated.
Please use
poll_job_complete()
instead.
- cancel_query(*args, **kwargs)[source]¶
This method is deprecated.
Please use
cancel_query()
instead.
- run_with_configuration(*args, **kwargs)[source]¶
This method is deprecated.
Please use
run_with_configuration()
instead.
- run_load(*args, **kwargs)[source]¶
This method is deprecated.
Please use
run_load()
instead.
- run_copy(*args, **kwargs)[source]¶
This method is deprecated.
Please use
run_copy()
instead.
- run_extract(*args, **kwargs)[source]¶
This method is deprecated.
Please use
run_extract()
instead.
- run_query(*args, **kwargs)[source]¶
This method is deprecated.
Please use
run_query()
instead.
- class airflow.providers.google.cloud.hooks.bigquery.BigQueryCursor(service, project_id, hook, use_legacy_sql=True, location=None, num_retries=5)[source]¶
Bases:
BigQueryBaseCursor
A very basic BigQuery PEP 249 cursor implementation.
The PyHive PEP 249 implementation was used as a reference:
https://github.com/dropbox/PyHive/blob/master/pyhive/presto.py https://github.com/dropbox/PyHive/blob/master/pyhive/common.py
- executemany(operation, seq_of_parameters)[source]¶
Execute a BigQuery query multiple times with different parameters.
- next()[source]¶
Return the next row from a buffer.
Helper method for
fetchone
.If the buffer is empty, attempts to paginate through the result set for the next page, and load it into the buffer.
- fetchmany(size=None)[source]¶
Fetch the next set of rows of a query result.
This returns a sequence of sequences (e.g. a list of tuples). An empty sequence is returned when no more rows are available.
The number of rows to fetch per call is specified by the parameter. If it is not given, the cursor’s arraysize determines the number of rows to be fetched.
This method tries to fetch as many rows as indicated by the size parameter. If this is not possible due to the specified number of rows not being available, fewer rows may be returned.
An
Error
(or subclass) exception is raised if the previous call toexecute()
did not produce any result set, or no call was issued yet.
- airflow.providers.google.cloud.hooks.bigquery.split_tablename(table_input, default_project_id, var_name=None)[source]¶
- class airflow.providers.google.cloud.hooks.bigquery.BigQueryAsyncHook(**kwargs)[source]¶
Bases:
airflow.providers.google.common.hooks.base_google.GoogleBaseAsyncHook
Uses gcloud-aio library to retrieve Job details.
- async get_job_instance(project_id, job_id, session)[source]¶
Get the specified job resource by job ID and project ID.
- async get_job_output(job_id, project_id=None)[source]¶
Get the BigQuery job output for a given job ID asynchronously.
- async create_job_for_partition_get(dataset_id, project_id=None)[source]¶
Create a new job and get the job_id using gcloud-aio.
- value_check(sql, pass_value, records, tolerance=None)[source]¶
Match a single query resulting row and tolerance with pass_value.
- Raises
AirflowException – if matching fails
- interval_check(row1, row2, metrics_thresholds, ignore_zero, ratio_formula)[source]¶
Check values of metrics (SQL expressions) are within a certain tolerance.
- Parameters
row1 (str | None) – first resulting row of a query execution job for first SQL query
row2 (str | None) – first resulting row of a query execution job for second SQL query
metrics_thresholds (dict[str, Any]) – a dictionary of ratios indexed by metrics, for example ‘COUNT(*)’: 1.5 would require a 50 percent or less difference between the current day, and the prior days_back.
ignore_zero (bool) – whether we should ignore zero metrics
ratio_formula (str) – which formula to use to compute the ratio between the two metrics. Assuming cur is the metric of today and ref is the metric to today - days_back. max_over_min: computes max(cur, ref) / min(cur, ref) relative_diff: computes abs(cur-ref) / ref
- class airflow.providers.google.cloud.hooks.bigquery.BigQueryTableAsyncHook(**kwargs)[source]¶
Bases:
airflow.providers.google.common.hooks.base_google.GoogleBaseAsyncHook
Async hook for BigQuery Table.
- async get_table_client(dataset, table_id, project_id, session)[source]¶
Get a Google Big Query Table object.
- Parameters
dataset (str) – The name of the dataset in which to look for the table storage bucket.
table_id (str) – The name of the table to check the existence of.
project_id (str) – The Google cloud project in which to look for the table. The connection supplied to the hook must provide access to the specified project.
session (aiohttp.ClientSession) – aiohttp ClientSession