airflow.providers.google.cloud.hooks.bigquery¶

BigQuery Hook and a very basic PEP 249 implementation for BigQuery.

Attributes¶

`log`
`BigQueryJob`

Classes¶

`BigQueryHook`	Interact with BigQuery.
`BigQueryConnection`	BigQuery connection.
`BigQueryBaseCursor`	BigQuery cursor.
`BigQueryCursor`	A very basic BigQuery PEP 249 cursor implementation.
`BigQueryAsyncHook`	Uses gcloud-aio library to retrieve Job details.
`BigQueryTableAsyncHook`	Async hook for BigQuery Table.

Module Contents¶

airflow.providers.google.cloud.hooks.bigquery.log[source]¶

airflow.providers.google.cloud.hooks.bigquery.BigQueryJob[source]¶

class airflow.providers.google.cloud.hooks.bigquery.BigQueryHook(use_legacy_sql=True, location=None, priority='INTERACTIVE', api_resource_configs=None, impersonation_scopes=None, labels=None, **kwargs)[source]¶

Bases: airflow.providers.google.common.hooks.base_google.GoogleBaseHook, airflow.providers.common.sql.hooks.sql.DbApiHook

Interact with BigQuery.

This hook uses the Google Cloud connection.

Parameters:

gcp_conn_id – The Airflow connection used for GCP credentials.
use_legacy_sql (bool) – This specifies whether to use legacy SQL dialect.
location (str | None) – The location of the BigQuery resource.
priority (str) – Specifies a priority for the query. Possible values include INTERACTIVE and BATCH. The default value is INTERACTIVE.
api_resource_configs (dict | None) – This contains params configuration applied for Google BigQuery jobs.
impersonation_chain – This is the optional service account to impersonate using short term credentials.
impersonation_scopes (str | collections.abc.Sequence[str] | None) – Optional list of scopes for impersonated account. Will override scopes from connection.
labels (dict | None) – The BigQuery resource label.

conn_name_attr = 'gcp_conn_id'[source]¶

default_conn_name = 'google_cloud_bigquery_default'[source]¶

conn_type = 'gcpbigquery'[source]¶

hook_name = 'Google Bigquery'[source]¶

classmethod get_connection_form_widgets()[source]¶

Return connection widgets to add to connection form.

classmethod get_ui_field_behaviour()[source]¶

Return custom field behaviour.

use_legacy_sql: bool[source]¶

location: str | None[source]¶

priority: str[source]¶

running_job_id: str | None = None[source]¶

api_resource_configs: dict[source]¶

labels[source]¶

impersonation_scopes: str | collections.abc.Sequence[str] | None = None[source]¶

get_conn()[source]¶

Get a BigQuery PEP 249 connection object.

get_client(project_id=PROVIDE_PROJECT_ID, location=None)[source]¶

Get an authenticated BigQuery Client.

Parameters:

project_id (str) – Project ID for the project which the client acts on behalf of.
location (str | None) – Default location for jobs / datasets / tables.

get_uri()[source]¶

Override from DbApiHook for get_sqlalchemy_engine().

get_sqlalchemy_engine(engine_kwargs=None)[source]¶

Create an SQLAlchemy engine object.

Parameters:: engine_kwargs (dict | None) – Kwargs used in create_engine().

get_records(sql, parameters=None)[source]¶

Execute the sql and return a set of records.

Parameters:

sql – the sql statement to be executed (str) or a list of sql statements to execute
parameters – The parameters to render the SQL query with.

abstract insert_rows(table, rows, target_fields=None, commit_every=1000, replace=False, **kwargs)[source]¶

Insert rows.

Insertion is currently unsupported. Theoretically, you could use BigQuery’s streaming API to insert rows into a table, but this hasn’t been implemented.

get_df(sql, parameters=None, dialect=None, *, df_type: Literal['pandas'] = 'pandas', **kwargs) → pandas.DataFrame[source]¶

get_df(sql, parameters=None, dialect=None, *, df_type: Literal['polars'], **kwargs) → polars.DataFrame

Get a DataFrame for the BigQuery results.

The DbApiHook method must be overridden because Pandas doesn’t support PEP 249 connections, except for SQLite.

Parameters:

sql – The BigQuery SQL to execute.
parameters – The parameters to render the SQL query with (not used, leave to override superclass method)
dialect – Dialect of BigQuery SQL – legacy SQL or standard SQL defaults to use self.use_legacy_sql if not specified
kwargs – (optional) passed into pandas_gbq.read_gbq method

get_pandas_df(sql, parameters=None, dialect=None, **kwargs)[source]¶

Execute the sql and returns a pandas dataframe.

Parameters:

sql – the sql statement to be executed (str) or a list of sql statements to execute
parameters – The parameters to render the SQL query with.
kwargs – (optional) passed into pandas.io.sql.read_sql method

table_exists(dataset_id, table_id, project_id)[source]¶

Check if a table exists in Google BigQuery.

Parameters:

project_id (str) – The Google cloud project in which to look for the table. The connection supplied to the hook must provide access to the specified project.
dataset_id (str) – The name of the dataset in which to look for the table.
table_id (str) – The name of the table to check the existence of.

table_partition_exists(dataset_id, table_id, partition_id, project_id)[source]¶

Check if a partition exists in Google BigQuery.

Parameters:

project_id (str) – The Google cloud project in which to look for the table. The connection supplied to the hook must provide access to the specified project.
dataset_id (str) – The name of the dataset in which to look for the table.
table_id (str) – The name of the table to check the existence of.
partition_id (str) – The name of the partition to check the existence of.

create_table(dataset_id, table_id, table_resource, location=None, project_id=PROVIDE_PROJECT_ID, exists_ok=True, schema_fields=None, retry=DEFAULT_RETRY, timeout=None)[source]¶

Create a new, empty table in the dataset.

Parameters:

project_id (str) – Optional. The project to create the table into.
dataset_id (str) – Required. The dataset to create the table into.
table_id (str) – Required. The Name of the table to be created.
table_resource (dict[str, Any] | google.cloud.bigquery.table.Table | google.cloud.bigquery.table.TableReference | google.cloud.bigquery.table.TableListItem) – Required. Table resource as described in documentation: https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#Table If table is a reference, an empty table is created with the specified ID. The dataset that the table belongs to must already exist.

schema_fields (list | None) –

Optional. If set, the schema field list as defined here: https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.load.schema

schema_fields = [
    {"name": "emp_name", "type": "STRING", "mode": "REQUIRED"},
    {"name": "salary", "type": "INTEGER", "mode": "NULLABLE"},
]

location (str | None) – Optional. The location used for the operation.
exists_ok (bool) – Optional. If True, ignore “already exists” errors when creating the table.
retry (google.api_core.retry.Retry) – Optional. A retry object used to retry requests. If None is specified, requests will not be retried.
timeout (float | None) – Optional. The amount of time, in seconds, to wait for the request to complete. Note that if retry is specified, the timeout applies to each individual attempt.

create_empty_dataset(dataset_id=None, project_id=PROVIDE_PROJECT_ID, location=None, dataset_reference=None, exists_ok=True)[source]¶

Create a new empty dataset.

Parameters:

project_id (str) – The name of the project where we want to create an empty a dataset. Don’t need to provide, if projectId in dataset_reference.
dataset_id (str | None) – The id of dataset. Don’t need to provide, if datasetId in dataset_reference.
location (str | None) – (Optional) The geographic location where the dataset should reside. There is no default value but the dataset will be created in US if nothing is provided.
dataset_reference (dict[str, Any] | None) – Dataset reference that could be provided with request body. More info: https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets#resource
exists_ok (bool) – If True, ignore “already exists” errors when creating the dataset.

get_dataset_tables(dataset_id, project_id=PROVIDE_PROJECT_ID, max_results=None, retry=DEFAULT_RETRY)[source]¶

Get the list of tables for a given dataset.

For more information, see: https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/list

Parameters:

dataset_id (str) – the dataset ID of the requested dataset.
project_id (str) – (Optional) the project of the requested dataset. If None, self.project_id will be used.
max_results (int | None) – (Optional) the maximum number of tables to return.
retry (google.api_core.retry.Retry) – How to retry the RPC.

Returns:

List of tables associated with the dataset.

Return type:

list[dict[str, Any]]

delete_dataset(dataset_id, project_id=PROVIDE_PROJECT_ID, delete_contents=False, retry=DEFAULT_RETRY)[source]¶

Delete a dataset of Big query in your project.

Parameters:

project_id (str) – The name of the project where we have the dataset.
dataset_id (str) – The dataset to be delete.
delete_contents (bool) – If True, delete all the tables in the dataset. If False and the dataset contains tables, the request will fail.
retry (google.api_core.retry.Retry) – How to retry the RPC.

update_table(table_resource, fields=None, dataset_id=None, table_id=None, project_id=PROVIDE_PROJECT_ID)[source]¶

Change some fields of a table.

Use fields to specify which fields to update. At least one field must be provided. If a field is listed in fields and is None in table, the field value will be deleted.

If table.etag is not None, the update will only succeed if the table on the server has the same ETag. Thus reading a table with get_table, changing its fields, and then passing it to update_table will ensure that the changes will only be saved if no modifications to the table occurred since the read.

Parameters:

project_id (str) – The project to create the table into.
dataset_id (str | None) – The dataset to create the table into.
table_id (str | None) – The Name of the table to be created.
table_resource (dict[str, Any]) – Table resource as described in documentation: https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#Table The table has to contain tableReference or project_id, dataset_id and table_id have to be provided.
fields (list[str] | None) – The fields of table to change, spelled as the Table properties (e.g. “friendly_name”).

insert_all(project_id, dataset_id, table_id, rows, ignore_unknown_values=False, skip_invalid_rows=False, fail_on_error=False)[source]¶

Stream data into BigQuery one record at a time without a load job.

Parameters:

project_id (str) – The name of the project where we have the table
dataset_id (str) – The name of the dataset where we have the table
table_id (str) – The name of the table

rows (list) –

the rows to insert

rows = [{"json": {"a_key": "a_value_0"}}, {"json": {"a_key": "a_value_1"}}]

ignore_unknown_values (bool) – [Optional] Accept rows that contain values that do not match the schema. The unknown values are ignored. The default value is false, which treats unknown values as errors.
skip_invalid_rows (bool) – [Optional] Insert all valid rows of a request, even if invalid rows exist. The default value is false, which causes the entire request to fail if any invalid rows exist.
fail_on_error (bool) – [Optional] Force the task to fail if any errors occur. The default value is false, which indicates the task should not fail even if any insertion errors occur.

update_dataset(fields, dataset_resource, dataset_id=None, project_id=PROVIDE_PROJECT_ID, retry=DEFAULT_RETRY)[source]¶

Change some fields of a dataset.

Use fields to specify which fields to update. At least one field must be provided. If a field is listed in fields and is None in dataset, it will be deleted.

If dataset.etag is not None, the update will only succeed if the dataset on the server has the same ETag. Thus reading a dataset with get_dataset, changing its fields, and then passing it to update_dataset will ensure that the changes will only be saved if no modifications to the dataset occurred since the read.

Parameters:

dataset_resource (dict[str, Any]) – Dataset resource that will be provided in request body. https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets#resource
dataset_id (str | None) – The id of the dataset.
fields (collections.abc.Sequence[str]) – The properties of dataset to change (e.g. “friendly_name”).
project_id (str) – The Google Cloud Project ID
retry (google.api_core.retry.Retry) – How to retry the RPC.

get_datasets_list(project_id=PROVIDE_PROJECT_ID, include_all=False, filter_=None, max_results=None, page_token=None, retry=DEFAULT_RETRY, return_iterator=False)[source]¶

Get all BigQuery datasets in the current project.

For more information, see: https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets/list

Parameters:

project_id (str) – Google Cloud Project for which you try to get all datasets
include_all (bool) – True if results include hidden datasets. Defaults to False.
filter – An expression for filtering the results by label. For syntax, see https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets/list#filter.
filter – str
max_results (int | None) – Maximum number of datasets to return.
max_results – int
page_token (str | None) – Token representing a cursor into the datasets. If not passed, the API will return the first page of datasets. The token marks the beginning of the iterator to be returned and the value of the page_token can be accessed at next_page_token of the HTTPIterator.
page_token – str
retry (google.api_core.retry.Retry) – How to retry the RPC.
return_iterator (bool) – Instead of returning a list[Row], returns a HTTPIterator which can be used to obtain the next_page_token property.

get_dataset(dataset_id, project_id=PROVIDE_PROJECT_ID)[source]¶

Fetch the dataset referenced by dataset_id.

Parameters:

dataset_id (str) – The BigQuery Dataset ID
project_id (str) – The Google Cloud Project ID

Returns:

dataset_resource

Return type:

google.cloud.bigquery.dataset.Dataset

See also

For more information, see Dataset Resource content: https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets#resource

run_grant_dataset_view_access(source_dataset, view_dataset, view_table, view_project=None, project_id=PROVIDE_PROJECT_ID)[source]¶

Grant authorized view access of a dataset to a view table.

If this view has already been granted access to the dataset, do nothing. This method is not atomic. Running it may clobber a simultaneous update.

Parameters:

source_dataset (str) – the source dataset
view_dataset (str) – the dataset that the view is in
view_table (str) – the table of the view
project_id (str) – the project of the source dataset. If None, self.project_id will be used.
view_project (str | None) – the project that the view is in. If None, self.project_id will be used.

Returns:

the datasets resource of the source dataset.

Return type:

dict[str, Any]

run_table_upsert(dataset_id, table_resource, project_id=PROVIDE_PROJECT_ID)[source]¶

Update a table if it exists, otherwise create a new one.

Since BigQuery does not natively allow table upserts, this is not an atomic operation.

Parameters:

dataset_id (str) – the dataset to upsert the table into.
table_resource (dict[str, Any]) – a table resource. see https://cloud.google.com/bigquery/docs/reference/v2/tables#resource
project_id (str) – the project to upsert the table into. If None, project will be self.project_id.

delete_table(table_id, not_found_ok=True, project_id=PROVIDE_PROJECT_ID)[source]¶

Delete an existing table from the dataset.

If the table does not exist, return an error unless not_found_ok is set to True.

Parameters:

table_id (str) – A dotted (<project>.|<project>:)<dataset>.<table> that indicates which table will be deleted.
not_found_ok (bool) – if True, then return success even if the requested table does not exist.
project_id (str) – the project used to perform the request

list_rows(dataset_id, table_id, max_results=None, selected_fields=None, page_token=None, start_index=None, project_id=PROVIDE_PROJECT_ID, location=None, retry=DEFAULT_RETRY, return_iterator=False)[source]¶

List rows in a table.

See https://cloud.google.com/bigquery/docs/reference/rest/v2/tabledata/list

Parameters:

dataset_id (str) – the dataset ID of the requested table.
table_id (str) – the table ID of the requested table.
max_results (int | None) – the maximum results to return.
selected_fields (list[str] | str | None) – List of fields to return (comma-separated). If unspecified, all fields are returned.
page_token (str | None) – page token, returned from a previous call, identifying the result set.
start_index (int | None) – zero based index of the starting row to read.
project_id (str) – Project ID for the project which the client acts on behalf of.
location (str | None) – Default location for job.
retry (google.api_core.retry.Retry) – How to retry the RPC.
return_iterator (bool) – Instead of returning a list[Row], returns a RowIterator which can be used to obtain the next_page_token property.

Returns:

list of rows

Return type:

list[google.cloud.bigquery.table.Row] | google.cloud.bigquery.table.RowIterator

get_schema(dataset_id, table_id, project_id=PROVIDE_PROJECT_ID)[source]¶

Get the schema for a given dataset and table.

Parameters:

dataset_id (str) – the dataset ID of the requested table
table_id (str) – the table ID of the requested table
project_id (str) – the optional project ID of the requested table. If not provided, the connector’s configured project will be used.

Returns:

a table schema

Return type:

dict

update_table_schema(schema_fields_updates, include_policy_tags, dataset_id, table_id, project_id=PROVIDE_PROJECT_ID)[source]¶

Update fields within a schema for a given dataset and table.

Note that some fields in schemas are immutable; trying to change them will cause an exception.

If a new field is included, it will be inserted, which requires all required fields to be set.

Parameters:

include_policy_tags (bool) – If set to True policy tags will be included in the update request which requires special permissions even if unchanged see https://cloud.google.com/bigquery/docs/column-level-security#roles
dataset_id (str) – the dataset ID of the requested table to be updated
table_id (str) – the table ID of the table to be updated

schema_fields_updates (list[dict[str, Any]]) –

a partial schema resource. See https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#TableSchema

schema_fields_updates = [
    {"name": "emp_name", "description": "Some New Description"},
    {"name": "salary", "description": "Some New Description"},
    {
        "name": "departments",
        "fields": [
            {"name": "name", "description": "Some New Description"},
            {"name": "type", "description": "Some New Description"},
        ],
    },
]

project_id (str) – The name of the project where we want to update the table.

poll_job_complete(job_id, project_id=PROVIDE_PROJECT_ID, location=None, retry=DEFAULT_RETRY)[source]¶

Check if jobs have completed.

Parameters:

job_id (str) – id of the job.
project_id (str) – Google Cloud Project where the job is running
location (str | None) – location the job is running
retry (google.api_core.retry.Retry) – How to retry the RPC.

cancel_job(job_id, project_id=PROVIDE_PROJECT_ID, location=None)[source]¶

Cancel a job and wait for cancellation to complete.

Parameters:

job_id (str) – id of the job.
project_id (str) – Google Cloud Project where the job is running
location (str | None) – location the job is running

get_job(job_id, project_id=PROVIDE_PROJECT_ID, location=None)[source]¶

Retrieve a BigQuery job.

Parameters:

job_id (str) – The ID of the job. The ID must contain only letters (a-z, A-Z), numbers (0-9), underscores (_), or dashes (-). The maximum length is 1,024 characters.
project_id (str) – Google Cloud Project where the job is running.
location (str | None) – Location where the job is running.

insert_job(configuration, job_id=None, project_id=PROVIDE_PROJECT_ID, location=None, nowait=False, retry=DEFAULT_RETRY, timeout=None)[source]¶

Execute a BigQuery job and wait for it to complete.

Parameters:

configuration (dict) – The configuration parameter maps directly to BigQuery’s configuration field in the job object. See https://cloud.google.com/bigquery/docs/reference/v2/jobs for details.
job_id (str | None) – The ID of the job. The ID must contain only letters (a-z, A-Z), numbers (0-9), underscores (_), or dashes (-). The maximum length is 1,024 characters. If not provided then uuid will be generated.
project_id (str) – Google Cloud Project where the job is running.
location (str | None) – Location the job is running.
nowait (bool) – Whether to insert job without waiting for the result.
retry (google.api_core.retry.Retry) – How to retry the RPC.
timeout (float | None) – The number of seconds to wait for the underlying HTTP transport before using retry.

Returns:

The job ID.

Return type:

BigQueryJob

generate_job_id(job_id, dag_id, task_id, logical_date, configuration, force_rerun=False)[source]¶

split_tablename(table_input, default_project_id, var_name=None)[source]¶

get_query_results(job_id, location, max_results=None, selected_fields=None, project_id=PROVIDE_PROJECT_ID, retry=DEFAULT_RETRY, job_retry=DEFAULT_JOB_RETRY)[source]¶

Get query results given a job_id.

Parameters:

job_id (str) – The ID of the job. The ID must contain only letters (a-z, A-Z), numbers (0-9), underscores (_), or dashes (-). The maximum length is 1,024 characters.
location (str) – The location used for the operation.
selected_fields (list[str] | str | None) – List of fields to return (comma-separated). If unspecified, all fields are returned.
max_results (int | None) – The maximum number of records (rows) to be fetched from the table.
project_id (str) – Google Cloud Project where the job ran.
retry (google.api_core.retry.Retry) – How to retry the RPC.
job_retry (google.api_core.retry.Retry) – How to retry failed jobs.

Returns:

List of rows where columns are filtered by selected fields, when given

Raises:

AirflowException

Return type:

list[dict[str, Any]]

property scopes: collections.abc.Sequence[str][source]¶

Return OAuth 2.0 scopes.

Returns:: Returns the scope defined in impersonation_scopes, the connection configuration, or the default scope
Return type:: collections.abc.Sequence[str]

class airflow.providers.google.cloud.hooks.bigquery.BigQueryConnection(*args, **kwargs)[source]¶

BigQuery connection.

BigQuery does not have a notion of a persistent connection. Thus, these objects are small stateless factories for cursors, which do all the real work.

close()[source]¶

Do nothing. Not needed for BigQueryConnection.

commit()[source]¶

Do nothing. BigQueryConnection does not support transactions.

cursor()[source]¶

Return a new Cursor object using the connection.

abstract rollback()[source]¶

Do nothing. BigQueryConnection does not support transactions.

class airflow.providers.google.cloud.hooks.bigquery.BigQueryBaseCursor(service, project_id, hook, use_legacy_sql=True, api_resource_configs=None, location=None, num_retries=5, labels=None)[source]¶

Bases: airflow.utils.log.logging_mixin.LoggingMixin

BigQuery cursor.

The BigQuery base cursor contains helper methods to execute queries against BigQuery. The methods can be used directly by operators, in cases where a PEP 249 cursor isn’t needed.

service[source]¶

project_id[source]¶

use_legacy_sql = True[source]¶

api_resource_configs: dict[source]¶

running_job_id: str | None = None[source]¶

location = None[source]¶

num_retries = 5[source]¶

labels = None[source]¶

hook[source]¶

class airflow.providers.google.cloud.hooks.bigquery.BigQueryCursor(service, project_id, hook, use_legacy_sql=True, location=None, num_retries=5)[source]¶

Bases: BigQueryBaseCursor

A very basic BigQuery PEP 249 cursor implementation.

The PyHive PEP 249 implementation was used as a reference:

https://github.com/dropbox/PyHive/blob/master/pyhive/presto.py https://github.com/dropbox/PyHive/blob/master/pyhive/common.py

buffersize: int | None = None[source]¶

page_token: str | None = None[source]¶

job_id: str | None = None[source]¶

buffer: list = [][source]¶

all_pages_loaded: bool = False[source]¶

property description: list[source]¶

Return the cursor description.

close()[source]¶

By default, do nothing.

property rowcount: int[source]¶

By default, return -1 to indicate that this is not supported.

execute(operation, parameters=None)[source]¶

Execute a BigQuery query, and update the BigQueryCursor description.

Parameters:

operation (str) – The query to execute.
parameters (dict | None) – Parameters to substitute into the query.

executemany(operation, seq_of_parameters)[source]¶

Execute a BigQuery query multiple times with different parameters.

Parameters:

operation (str) – The query to execute.
seq_of_parameters (list) – List of dictionary parameters to substitute into the query.

flush_results()[source]¶

Flush results related cursor attributes.

fetchone()[source]¶

Fetch the next row of a query result set.

next()[source]¶

Return the next row from a buffer.

Helper method for fetchone.

If the buffer is empty, attempts to paginate through the result set for the next page, and load it into the buffer.

fetchmany(size=None)[source]¶

Fetch the next set of rows of a query result.

This returns a sequence of sequences (e.g. a list of tuples). An empty sequence is returned when no more rows are available.

The number of rows to fetch per call is specified by the parameter. If it is not given, the cursor’s arraysize determines the number of rows to be fetched.

This method tries to fetch as many rows as indicated by the size parameter. If this is not possible due to the specified number of rows not being available, fewer rows may be returned.

An Error (or subclass) exception is raised if the previous call to execute() did not produce any result set, or no call was issued yet.

fetchall()[source]¶

Fetch all (remaining) rows of a query result.

A sequence of sequences (e.g. a list of tuples) is returned.

get_arraysize()[source]¶

Get number of rows to fetch at a time.

See also

fetchmany()

set_arraysize(arraysize)[source]¶

Set the number of rows to fetch at a time.

See also

fetchmany()

arraysize[source]¶

setinputsizes(sizes)[source]¶

Do nothing by default.

setoutputsize(size, column=None)[source]¶

Do nothing by default.

class airflow.providers.google.cloud.hooks.bigquery.BigQueryAsyncHook(gcp_conn_id='google_cloud_default', impersonation_chain=None, **kwargs)[source]¶

Bases: airflow.providers.google.common.hooks.base_google.GoogleBaseAsyncHook

Uses gcloud-aio library to retrieve Job details.

sync_hook_class[source]¶

async get_job_instance(project_id, job_id, session)[source]¶

Get the specified job resource by job ID and project ID.

async get_job_status(job_id, project_id=PROVIDE_PROJECT_ID, location=None)[source]¶

async get_job_output(job_id, project_id=PROVIDE_PROJECT_ID)[source]¶

Get the BigQuery job output for a given job ID asynchronously.

async create_job_for_partition_get(dataset_id, table_id=None, project_id=PROVIDE_PROJECT_ID)[source]¶

Create a new job and get the job_id using gcloud-aio.

async cancel_job(job_id, project_id, location)[source]¶

Cancel a BigQuery job.

Parameters:

job_id (str) – ID of the job to cancel.
project_id (str | None) – Google Cloud Project where the job was running.
location (str | None) – Location where the job was running.

get_records(query_results, as_dict=False, selected_fields=None)[source]¶

Convert a response from BigQuery to records.

Parameters:

query_results (dict[str, Any]) – the results from a SQL query
as_dict (bool) – if True returns the result as a list of dictionaries, otherwise as list of lists.
selected_fields (str | list[str] | None)

value_check(sql, pass_value, records=None, tolerance=None)[source]¶

Match a single query resulting row and tolerance with pass_value.

Raises:: AirflowException – if matching fails

interval_check(row1, row2, metrics_thresholds, ignore_zero, ratio_formula)[source]¶

Check values of metrics (SQL expressions) are within a certain tolerance.

Parameters:

row1 (str | None) – first resulting row of a query execution job for first SQL query
row2 (str | None) – first resulting row of a query execution job for second SQL query
metrics_thresholds (dict[str, Any]) – a dictionary of ratios indexed by metrics, for example ‘COUNT(*)’: 1.5 would require a 50 percent or less difference between the current day, and the prior days_back.
ignore_zero (bool) – whether we should ignore zero metrics
ratio_formula (str) – which formula to use to compute the ratio between the two metrics. Assuming cur is the metric of today and ref is the metric to today - days_back. max_over_min: computes max(cur, ref) / min(cur, ref) relative_diff: computes abs(cur-ref) / ref

class airflow.providers.google.cloud.hooks.bigquery.BigQueryTableAsyncHook(gcp_conn_id='google_cloud_default', impersonation_chain=None, **kwargs)[source]¶

Bases: airflow.providers.google.common.hooks.base_google.GoogleBaseAsyncHook

Async hook for BigQuery Table.

sync_hook_class[source]¶

async get_table_client(dataset, table_id, project_id, session)[source]¶

Get a Google Big Query Table object.

Parameters:

dataset (str) – The name of the dataset in which to look for the table storage bucket.
table_id (str) – The name of the table to check the existence of.
project_id (str) – The Google cloud project in which to look for the table. The connection supplied to the hook must provide access to the specified project.
session (aiohttp.ClientSession) – aiohttp ClientSession