airflow.providers.google.cloud.hooks.bigquery

This module contains a BigQuery Hook, as well as a very basic PEP 249 implementation for BigQuery.

Module Contents

airflow.providers.google.cloud.hooks.bigquery.log[source]
airflow.providers.google.cloud.hooks.bigquery.BigQueryJob[source]
class airflow.providers.google.cloud.hooks.bigquery.BigQueryHook(gcp_conn_id: str = default_conn_name, delegate_to: Optional[str] = None, use_legacy_sql: bool = True, location: Optional[str] = None, bigquery_conn_id: Optional[str] = None, api_resource_configs: Optional[Dict] = None, impersonation_chain: Optional[Union[str, Sequence[str]]] = None)[source]

Bases: airflow.providers.google.common.hooks.base_google.GoogleBaseHook, airflow.hooks.dbapi.DbApiHook

Interact with BigQuery. This hook uses the Google Cloud connection.

conn_name_attr = gcp_conn_id[source]
default_conn_name = google_cloud_default[source]
conn_type = google_cloud_platform[source]
hook_name = Google Cloud[source]
get_conn(self)[source]

Returns a BigQuery PEP 249 connection object.

get_service(self)[source]

Returns a BigQuery service object.

get_client(self, project_id: Optional[str] = None, location: Optional[str] = None)[source]

Returns authenticated BigQuery Client.

Parameters
  • project_id (str) – Project ID for the project which the client acts on behalf of.

  • location (str) – Default location for jobs / datasets / tables.

Returns

static _resolve_table_reference(table_resource: Dict[str, Any], project_id: Optional[str] = None, dataset_id: Optional[str] = None, table_id: Optional[str] = None)[source]
insert_rows(self, table: Any, rows: Any, target_fields: Any = None, commit_every: Any = 1000, replace: Any = False, **kwargs)[source]

Insertion is currently unsupported. Theoretically, you could use BigQuery’s streaming API to insert rows into a table, but this hasn’t been implemented.

get_pandas_df(self, sql: str, parameters: Optional[Union[Iterable, Mapping]] = None, dialect: Optional[str] = None, **kwargs)[source]

Returns a Pandas DataFrame for the results produced by a BigQuery query. The DbApiHook method must be overridden because Pandas doesn’t support PEP 249 connections, except for SQLite. See:

https://github.com/pydata/pandas/blob/master/pandas/io/sql.py#L447 https://github.com/pydata/pandas/issues/6900

Parameters
  • sql (str) – The BigQuery SQL to execute.

  • parameters (mapping or iterable) – The parameters to render the SQL query with (not used, leave to override superclass method)

  • dialect (str in {'legacy', 'standard'}) – Dialect of BigQuery SQL – legacy SQL or standard SQL defaults to use self.use_legacy_sql if not specified

  • kwargs (dict) – (optional) passed into pandas_gbq.read_gbq method

table_exists(self, dataset_id: str, table_id: str, project_id: str)[source]

Checks for the existence of a table in Google BigQuery.

Parameters
  • project_id (str) – The Google cloud project in which to look for the table. The connection supplied to the hook must provide access to the specified project.

  • dataset_id (str) – The name of the dataset in which to look for the table.

  • table_id (str) – The name of the table to check the existence of.

table_partition_exists(self, dataset_id: str, table_id: str, partition_id: str, project_id: str)[source]

Checks for the existence of a partition in a table in Google BigQuery.

Parameters
  • project_id (str) – The Google cloud project in which to look for the table. The connection supplied to the hook must provide access to the specified project.

  • dataset_id (str) – The name of the dataset in which to look for the table.

  • table_id (str) – The name of the table to check the existence of.

  • partition_id (str) – The name of the partition to check the existence of.

create_empty_table(self, project_id: Optional[str] = None, dataset_id: Optional[str] = None, table_id: Optional[str] = None, table_resource: Optional[Dict[str, Any]] = None, schema_fields: Optional[List] = None, time_partitioning: Optional[Dict] = None, cluster_fields: Optional[List[str]] = None, labels: Optional[Dict] = None, view: Optional[Dict] = None, materialized_view: Optional[Dict] = None, encryption_configuration: Optional[Dict] = None, retry: Optional[Retry] = DEFAULT_RETRY, num_retries: Optional[int] = None, location: Optional[str] = None, exists_ok: bool = True)[source]

Creates a new, empty table in the dataset. To create a view, which is defined by a SQL query, parse a dictionary to ‘view’ kwarg

Parameters

Example:

schema_fields=[{"name": "emp_name", "type": "STRING", "mode": "REQUIRED"},
               {"name": "salary", "type": "INTEGER", "mode": "NULLABLE"}]
Parameters

Example:

view = {
    "query": "SELECT * FROM `test-project-id.test_dataset_id.test_table_prefix*` LIMIT 1000",
    "useLegacySql": False
}
Parameters
  • materialized_view (dict) – [Optional] The materialized view definition.

  • encryption_configuration (dict) –

    [Optional] Custom encryption configuration (e.g., Cloud KMS keys). Example:

    encryption_configuration = {
        "kmsKeyName": "projects/testp/locations/us/keyRings/test-kr/cryptoKeys/test-key"
    }
    

  • num_retries (int) – Maximum number of retries in case of connection problems.

  • exists_ok (bool) – If True, ignore “already exists” errors when creating the table.

Returns

Created table

create_empty_dataset(self, dataset_id: Optional[str] = None, project_id: Optional[str] = None, location: Optional[str] = None, dataset_reference: Optional[Dict[str, Any]] = None, exists_ok: bool = True)[source]

Create a new empty dataset: https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets/insert

Parameters
  • project_id (str) – The name of the project where we want to create an empty a dataset. Don’t need to provide, if projectId in dataset_reference.

  • dataset_id (str) – The id of dataset. Don’t need to provide, if datasetId in dataset_reference.

  • location (str) – (Optional) The geographic location where the dataset should reside. There is no default value but the dataset will be created in US if nothing is provided.

  • dataset_reference (dict) – Dataset reference that could be provided with request body. More info: https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets#resource

  • exists_ok (bool) – If True, ignore “already exists” errors when creating the dataset.

get_dataset_tables(self, dataset_id: str, project_id: Optional[str] = None, max_results: Optional[int] = None, retry: Retry = DEFAULT_RETRY)[source]

Get the list of tables for a given dataset.

For more information, see: https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/list

Parameters
  • dataset_id (str) – the dataset ID of the requested dataset.

  • project_id (str) – (Optional) the project of the requested dataset. If None, self.project_id will be used.

  • max_results (int) – (Optional) the maximum number of tables to return.

  • retry (google.api_core.retry.Retry) – How to retry the RPC.

Returns

List of tables associated with the dataset.

delete_dataset(self, dataset_id: str, project_id: Optional[str] = None, delete_contents: bool = False, retry: Retry = DEFAULT_RETRY)[source]

Delete a dataset of Big query in your project.

Parameters
  • project_id (str) – The name of the project where we have the dataset.

  • dataset_id (str) – The dataset to be delete.

  • delete_contents (bool) – If True, delete all the tables in the dataset. If False and the dataset contains tables, the request will fail.

  • retry (google.api_core.retry.Retry) – How to retry the RPC.

create_external_table(self, external_project_dataset_table: str, schema_fields: List, source_uris: List, source_format: str = 'CSV', autodetect: bool = False, compression: str = 'NONE', ignore_unknown_values: bool = False, max_bad_records: int = 0, skip_leading_rows: int = 0, field_delimiter: str = ',', quote_character: Optional[str] = None, allow_quoted_newlines: bool = False, allow_jagged_rows: bool = False, encoding: str = 'UTF-8', src_fmt_configs: Optional[Dict] = None, labels: Optional[Dict] = None, encryption_configuration: Optional[Dict] = None, location: Optional[str] = None, project_id: Optional[str] = None)[source]

Creates a new external table in the dataset with the data from Google Cloud Storage. See here:

https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#resource

for more details about these parameters.

Parameters
  • external_project_dataset_table (str) – The dotted (<project>.|<project>:)<dataset>.<table>($<partition>) BigQuery table name to create external table. If <project> is not included, project will be the project defined in the connection json.

  • schema_fields (list) – The schema field list as defined here: https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#resource

  • source_uris (list) – The source Google Cloud Storage URI (e.g. gs://some-bucket/some-file.txt). A single wild per-object name can be used.

  • source_format (str) – File format to export.

  • autodetect (bool) – Try to detect schema and format options automatically. Any option specified explicitly will be honored.

  • compression (str) – [Optional] The compression type of the data source. Possible values include GZIP and NONE. The default value is NONE. This setting is ignored for Google Cloud Bigtable, Google Cloud Datastore backups and Avro formats.

  • ignore_unknown_values (bool) – [Optional] Indicates if BigQuery should allow extra values that are not represented in the table schema. If true, the extra values are ignored. If false, records with extra columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result.

  • max_bad_records (int) – The maximum number of bad records that BigQuery can ignore when running the job.

  • skip_leading_rows (int) – Number of rows to skip when loading from a CSV.

  • field_delimiter (str) – The delimiter to use when loading from a CSV.

  • quote_character (str) – The value that is used to quote data sections in a CSV file.

  • allow_quoted_newlines (bool) – Whether to allow quoted newlines (true) or not (false).

  • allow_jagged_rows (bool) – Accept rows that are missing trailing optional columns. The missing values are treated as nulls. If false, records with missing trailing columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result. Only applicable when source_format is CSV.

  • encoding (str) –

    The character encoding of the data. See:

  • src_fmt_configs (dict) – configure optional fields specific to the source format

  • labels (dict) – a dictionary containing labels for the table, passed to BigQuery

  • encryption_configuration (dict) –

    [Optional] Custom encryption configuration (e.g., Cloud KMS keys). Example:

    encryption_configuration = {
        "kmsKeyName": "projects/testp/locations/us/keyRings/test-kr/cryptoKeys/test-key"
    }
    

update_table(self, table_resource: Dict[str, Any], fields: Optional[List[str]] = None, dataset_id: Optional[str] = None, table_id: Optional[str] = None, project_id: Optional[str] = None)[source]

Change some fields of a table.

Use fields to specify which fields to update. At least one field must be provided. If a field is listed in fields and is None in table, the field value will be deleted.

If table.etag is not None, the update will only succeed if the table on the server has the same ETag. Thus reading a table with get_table, changing its fields, and then passing it to update_table will ensure that the changes will only be saved if no modifications to the table occurred since the read.

Parameters
  • project_id (str) – The project to create the table into.

  • dataset_id (str) – The dataset to create the table into.

  • table_id (str) – The Name of the table to be created.

  • table_resource (Dict[str, Any]) – Table resource as described in documentation: https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#Table The table has to contain tableReference or project_id, dataset_id and table_id have to be provided.

  • fields (List[str]) – The fields of table to change, spelled as the Table properties (e.g. “friendly_name”).

patch_table(self, dataset_id: str, table_id: str, project_id: Optional[str] = None, description: Optional[str] = None, expiration_time: Optional[int] = None, external_data_configuration: Optional[Dict] = None, friendly_name: Optional[str] = None, labels: Optional[Dict] = None, schema: Optional[List] = None, time_partitioning: Optional[Dict] = None, view: Optional[Dict] = None, require_partition_filter: Optional[bool] = None, encryption_configuration: Optional[Dict] = None)[source]

Patch information in an existing table. It only updates fields that are provided in the request object.

Reference: https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/patch

Parameters
  • dataset_id (str) – The dataset containing the table to be patched.

  • table_id (str) – The Name of the table to be patched.

  • project_id (str) – The project containing the table to be patched.

  • description (str) – [Optional] A user-friendly description of this table.

  • expiration_time (int) – [Optional] The time when this table expires, in milliseconds since the epoch.

  • external_data_configuration (dict) – [Optional] A dictionary containing properties of a table stored outside of BigQuery.

  • friendly_name (str) – [Optional] A descriptive name for this table.

  • labels (dict) – [Optional] A dictionary containing labels associated with this table.

  • schema (list) –

    [Optional] If set, the schema field list as defined here: https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.load.schema The supported schema modifications and unsupported schema modification are listed here: https://cloud.google.com/bigquery/docs/managing-table-schemas Example:

    schema=[{"name": "emp_name", "type": "STRING", "mode": "REQUIRED"},
                   {"name": "salary", "type": "INTEGER", "mode": "NULLABLE"}]
    

  • time_partitioning (dict) – [Optional] A dictionary containing time-based partitioning definition for the table.

  • view (dict) –

    [Optional] A dictionary containing definition for the view. If set, it will patch a view instead of a table: https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#ViewDefinition Example:

    view = {
        "query": "SELECT * FROM `test-project-id.test_dataset_id.test_table_prefix*` LIMIT 500",
        "useLegacySql": False
    }
    

  • require_partition_filter (bool) – [Optional] If true, queries over the this table require a partition filter. If false, queries over the table

  • encryption_configuration (dict) –

    [Optional] Custom encryption configuration (e.g., Cloud KMS keys). Example:

    encryption_configuration = {
        "kmsKeyName": "projects/testp/locations/us/keyRings/test-kr/cryptoKeys/test-key"
    }
    

insert_all(self, project_id: str, dataset_id: str, table_id: str, rows: List, ignore_unknown_values: bool = False, skip_invalid_rows: bool = False, fail_on_error: bool = False)[source]

Method to stream data into BigQuery one record at a time without needing to run a load job

Parameters
  • project_id (str) – The name of the project where we have the table

  • dataset_id (str) – The name of the dataset where we have the table

  • table_id (str) – The name of the table

  • rows (list) – the rows to insert

Example or rows:

rows=[{“json”: {“a_key”: “a_value_0”}}, {“json”: {“a_key”: “a_value_1”}}]

Parameters
  • ignore_unknown_values (bool) – [Optional] Accept rows that contain values that do not match the schema. The unknown values are ignored. The default value is false, which treats unknown values as errors.

  • skip_invalid_rows (bool) – [Optional] Insert all valid rows of a request, even if invalid rows exist. The default value is false, which causes the entire request to fail if any invalid rows exist.

  • fail_on_error (bool) – [Optional] Force the task to fail if any errors occur. The default value is false, which indicates the task should not fail even if any insertion errors occur.

update_dataset(self, fields: Sequence[str], dataset_resource: Dict[str, Any], dataset_id: Optional[str] = None, project_id: Optional[str] = None, retry: Retry = DEFAULT_RETRY)[source]

Change some fields of a dataset.

Use fields to specify which fields to update. At least one field must be provided. If a field is listed in fields and is None in dataset, it will be deleted.

If dataset.etag is not None, the update will only succeed if the dataset on the server has the same ETag. Thus reading a dataset with get_dataset, changing its fields, and then passing it to update_dataset will ensure that the changes will only be saved if no modifications to the dataset occurred since the read.

Parameters
patch_dataset(self, dataset_id: str, dataset_resource: Dict, project_id: Optional[str] = None)[source]

Patches information in an existing dataset. It only replaces fields that are provided in the submitted dataset resource. More info: https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets/patch

Parameters
Return type

dataset https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets#resource

get_dataset_tables_list(self, dataset_id: str, project_id: Optional[str] = None, table_prefix: Optional[str] = None, max_results: Optional[int] = None)[source]

Method returns tables list of a BigQuery tables. If table prefix is specified, only tables beginning by it are returned.

For more information, see: https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/list

Parameters
  • dataset_id (str) – The BigQuery Dataset ID

  • project_id (str) – The Google Cloud Project ID

  • table_prefix (str) – Tables must begin by this prefix to be returned (case sensitive)

  • max_results (int) – The maximum number of results to return in a single response page. Leverage the page tokens to iterate through the entire collection.

Returns

List of tables associated with the dataset

get_datasets_list(self, project_id: Optional[str] = None, include_all: bool = False, filter_: Optional[str] = None, max_results: Optional[int] = None, page_token: Optional[str] = None, retry: Retry = DEFAULT_RETRY)[source]

Method returns full list of BigQuery datasets in the current project

For more information, see: https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets/list

Parameters
  • project_id (str) – Google Cloud Project for which you try to get all datasets

  • include_all – True if results include hidden datasets. Defaults to False.

  • filter – An expression for filtering the results by label. For syntax, see https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets/list#filter.

  • filter – str

  • max_results – Maximum number of datasets to return.

  • max_results – int

  • page_token – Token representing a cursor into the datasets. If not passed, the API will return the first page of datasets. The token marks the beginning of the iterator to be returned and the value of the page_token can be accessed at next_page_token of the HTTPIterator.

  • page_token – str

  • retry (google.api_core.retry.Retry) – How to retry the RPC.

get_dataset(self, dataset_id: str, project_id: Optional[str] = None)[source]

Fetch the dataset referenced by dataset_id.

Parameters
  • dataset_id (str) – The BigQuery Dataset ID

  • project_id (str) – The Google Cloud Project ID

Returns

dataset_resource

See also

For more information, see Dataset Resource content: https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets#resource

run_grant_dataset_view_access(self, source_dataset: str, view_dataset: str, view_table: str, source_project: Optional[str] = None, view_project: Optional[str] = None, project_id: Optional[str] = None)[source]

Grant authorized view access of a dataset to a view table. If this view has already been granted access to the dataset, do nothing. This method is not atomic. Running it may clobber a simultaneous update.

Parameters
  • source_dataset (str) – the source dataset

  • view_dataset (str) – the dataset that the view is in

  • view_table (str) – the table of the view

  • project_id (str) – the project of the source dataset. If None, self.project_id will be used.

  • view_project (str) – the project that the view is in. If None, self.project_id will be used.

Returns

the datasets resource of the source dataset.

run_table_upsert(self, dataset_id: str, table_resource: Dict[str, Any], project_id: Optional[str] = None)[source]

If the table already exists, update the existing table if not create new. Since BigQuery does not natively allow table upserts, this is not an atomic operation.

Parameters
Returns

run_table_delete(self, deletion_dataset_table: str, ignore_if_missing: bool = False)[source]

Delete an existing table from the dataset; If the table does not exist, return an error unless ignore_if_missing is set to True.

Parameters
  • deletion_dataset_table (str) – A dotted (<project>.|<project>:)<dataset>.<table> that indicates which table will be deleted.

  • ignore_if_missing (bool) – if True, then return success even if the requested table does not exist.

Returns

delete_table(self, table_id: str, not_found_ok: bool = True, project_id: Optional[str] = None)[source]

Delete an existing table from the dataset. If the table does not exist, return an error unless not_found_ok is set to True.

Parameters
  • table_id (str) – A dotted (<project>.|<project>:)<dataset>.<table> that indicates which table will be deleted.

  • not_found_ok (bool) – if True, then return success even if the requested table does not exist.

  • project_id (str) – the project used to perform the request

get_tabledata(self, dataset_id: str, table_id: str, max_results: Optional[int] = None, selected_fields: Optional[str] = None, page_token: Optional[str] = None, start_index: Optional[int] = None)[source]

Get the data of a given dataset.table and optionally with selected columns. see https://cloud.google.com/bigquery/docs/reference/v2/tabledata/list

Parameters
  • dataset_id – the dataset ID of the requested table.

  • table_id – the table ID of the requested table.

  • max_results – the maximum results to return.

  • selected_fields – List of fields to return (comma-separated). If unspecified, all fields are returned.

  • page_token – page token, returned from a previous call, identifying the result set.

  • start_index – zero based index of the starting row to read.

Returns

list of rows

list_rows(self, dataset_id: str, table_id: str, max_results: Optional[int] = None, selected_fields: Optional[Union[List[str], str]] = None, page_token: Optional[str] = None, start_index: Optional[int] = None, project_id: Optional[str] = None, location: Optional[str] = None)[source]

List the rows of the table. See https://cloud.google.com/bigquery/docs/reference/rest/v2/tabledata/list

Parameters
  • dataset_id – the dataset ID of the requested table.

  • table_id – the table ID of the requested table.

  • max_results – the maximum results to return.

  • selected_fields – List of fields to return (comma-separated). If unspecified, all fields are returned.

  • page_token – page token, returned from a previous call, identifying the result set.

  • start_index – zero based index of the starting row to read.

  • project_id – Project ID for the project which the client acts on behalf of.

  • location – Default location for job.

Returns

list of rows

get_schema(self, dataset_id: str, table_id: str, project_id: Optional[str] = None)[source]

Get the schema for a given dataset and table. see https://cloud.google.com/bigquery/docs/reference/v2/tables#resource

Parameters
  • dataset_id – the dataset ID of the requested table

  • table_id – the table ID of the requested table

  • project_id – the optional project ID of the requested table. If not provided, the connector’s configured project will be used.

Returns

a table schema

poll_job_complete(self, job_id: str, project_id: Optional[str] = None, location: Optional[str] = None, retry: Retry = DEFAULT_RETRY)[source]

Check if jobs completed.

Parameters
  • job_id (str) – id of the job.

  • project_id (str) – Google Cloud Project where the job is running

  • location (str) – location the job is running

  • retry (google.api_core.retry.Retry) – How to retry the RPC.

Return type

bool

cancel_query(self)[source]

Cancel all started queries that have not yet completed

cancel_job(self, job_id: str, project_id: Optional[str] = None, location: Optional[str] = None)[source]

Cancels a job an wait for cancellation to complete

Parameters
  • job_id (str) – id of the job.

  • project_id (str) – Google Cloud Project where the job is running

  • location (str) – location the job is running

get_job(self, job_id: Optional[str] = None, project_id: Optional[str] = None, location: Optional[str] = None)[source]

Retrieves a BigQuery job. For more information see: https://cloud.google.com/bigquery/docs/reference/v2/jobs

Parameters
  • job_id (str) – The ID of the job. The ID must contain only letters (a-z, A-Z), numbers (0-9), underscores (_), or dashes (-). The maximum length is 1,024 characters. If not provided then uuid will be generated.

  • project_id (str) – Google Cloud Project where the job is running

  • location (str) – location the job is running

static _custom_job_id(configuration: Dict[str, Any])[source]
insert_job(self, configuration: Dict, job_id: Optional[str] = None, project_id: Optional[str] = None, location: Optional[str] = None)[source]

Executes a BigQuery job. Waits for the job to complete and returns job id. See here:

https://cloud.google.com/bigquery/docs/reference/v2/jobs

Parameters
  • configuration (Dict[str, Any]) – The configuration parameter maps directly to BigQuery’s configuration field in the job object. See https://cloud.google.com/bigquery/docs/reference/v2/jobs for details.

  • job_id (str) – The ID of the job. The ID must contain only letters (a-z, A-Z), numbers (0-9), underscores (_), or dashes (-). The maximum length is 1,024 characters. If not provided then uuid will be generated.

  • project_id (str) – Google Cloud Project where the job is running

  • location (str) – location the job is running

run_with_configuration(self, configuration: dict)[source]

Executes a BigQuery SQL query. See here:

https://cloud.google.com/bigquery/docs/reference/v2/jobs

For more details about the configuration parameter.

Parameters

configuration – The configuration parameter maps directly to BigQuery’s configuration field in the job object. See https://cloud.google.com/bigquery/docs/reference/v2/jobs for details.

run_load(self, destination_project_dataset_table: str, source_uris: List, schema_fields: Optional[List] = None, source_format: str = 'CSV', create_disposition: str = 'CREATE_IF_NEEDED', skip_leading_rows: int = 0, write_disposition: str = 'WRITE_EMPTY', field_delimiter: str = ',', max_bad_records: int = 0, quote_character: Optional[str] = None, ignore_unknown_values: bool = False, allow_quoted_newlines: bool = False, allow_jagged_rows: bool = False, encoding: str = 'UTF-8', schema_update_options: Optional[Iterable] = None, src_fmt_configs: Optional[Dict] = None, time_partitioning: Optional[Dict] = None, cluster_fields: Optional[List] = None, autodetect: bool = False, encryption_configuration: Optional[Dict] = None)[source]

Executes a BigQuery load command to load data from Google Cloud Storage to BigQuery. See here:

https://cloud.google.com/bigquery/docs/reference/v2/jobs

For more details about these parameters.

Parameters
  • destination_project_dataset_table (str) – The dotted (<project>.|<project>:)<dataset>.<table>($<partition>) BigQuery table to load data into. If <project> is not included, project will be the project defined in the connection json. If a partition is specified the operator will automatically append the data, create a new partition or create a new DAY partitioned table.

  • schema_fields (list) – The schema field list as defined here: https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.load Required if autodetect=False; optional if autodetect=True.

  • autodetect (bool) – Attempt to autodetect the schema for CSV and JSON source files.

  • source_uris (list) – The source Google Cloud Storage URI (e.g. gs://some-bucket/some-file.txt). A single wild per-object name can be used.

  • source_format (str) – File format to export.

  • create_disposition (str) – The create disposition if the table doesn’t exist.

  • skip_leading_rows (int) – Number of rows to skip when loading from a CSV.

  • write_disposition (str) – The write disposition if the table already exists.

  • field_delimiter (str) – The delimiter to use when loading from a CSV.

  • max_bad_records (int) – The maximum number of bad records that BigQuery can ignore when running the job.

  • quote_character (str) – The value that is used to quote data sections in a CSV file.

  • ignore_unknown_values (bool) – [Optional] Indicates if BigQuery should allow extra values that are not represented in the table schema. If true, the extra values are ignored. If false, records with extra columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result.

  • allow_quoted_newlines (bool) – Whether to allow quoted newlines (true) or not (false).

  • allow_jagged_rows (bool) – Accept rows that are missing trailing optional columns. The missing values are treated as nulls. If false, records with missing trailing columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result. Only applicable when source_format is CSV.

  • encoding (str) –

    The character encoding of the data.

  • schema_update_options (Union[list, tuple, set]) – Allows the schema of the destination table to be updated as a side effect of the load job.

  • src_fmt_configs (dict) – configure optional fields specific to the source format

  • time_partitioning (dict) – configure optional time partitioning fields i.e. partition by field, type and expiration as per API specifications.

  • cluster_fields (list[str]) – Request that the result of this load be stored sorted by one or more columns. BigQuery supports clustering for both partitioned and non-partitioned tables. The order of columns given determines the sort order.

  • encryption_configuration (dict) –

    [Optional] Custom encryption configuration (e.g., Cloud KMS keys). Example:

    encryption_configuration = {
        "kmsKeyName": "projects/testp/locations/us/keyRings/test-kr/cryptoKeys/test-key"
    }
    

run_copy(self, source_project_dataset_tables: Union[List, str], destination_project_dataset_table: str, write_disposition: str = 'WRITE_EMPTY', create_disposition: str = 'CREATE_IF_NEEDED', labels: Optional[Dict] = None, encryption_configuration: Optional[Dict] = None)[source]

Executes a BigQuery copy command to copy data from one BigQuery table to another. See here:

https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.copy

For more details about these parameters.

Parameters
  • source_project_dataset_tables (list|string) – One or more dotted (project:|project.)<dataset>.<table> BigQuery tables to use as the source data. Use a list if there are multiple source tables. If <project> is not included, project will be the project defined in the connection json.

  • destination_project_dataset_table (str) – The destination BigQuery table. Format is: (project:|project.)<dataset>.<table>

  • write_disposition (str) – The write disposition if the table already exists.

  • create_disposition (str) – The create disposition if the table doesn’t exist.

  • labels (dict) – a dictionary containing labels for the job/query, passed to BigQuery

  • encryption_configuration (dict) –

    [Optional] Custom encryption configuration (e.g., Cloud KMS keys). Example:

    encryption_configuration = {
        "kmsKeyName": "projects/testp/locations/us/keyRings/test-kr/cryptoKeys/test-key"
    }
    

run_extract(self, source_project_dataset_table: str, destination_cloud_storage_uris: str, compression: str = 'NONE', export_format: str = 'CSV', field_delimiter: str = ',', print_header: bool = True, labels: Optional[Dict] = None)[source]

Executes a BigQuery extract command to copy data from BigQuery to Google Cloud Storage. See here:

https://cloud.google.com/bigquery/docs/reference/v2/jobs

For more details about these parameters.

Parameters
  • source_project_dataset_table (str) – The dotted <dataset>.<table> BigQuery table to use as the source data.

  • destination_cloud_storage_uris (list) – The destination Google Cloud Storage URI (e.g. gs://some-bucket/some-file.txt). Follows convention defined here: https://cloud.google.com/bigquery/exporting-data-from-bigquery#exportingmultiple

  • compression (str) – Type of compression to use.

  • export_format (str) – File format to export.

  • field_delimiter (str) – The delimiter to use when extracting to a CSV.

  • print_header (bool) – Whether to print a header for a CSV file extract.

  • labels (dict) – a dictionary containing labels for the job/query, passed to BigQuery

run_query(self, sql: str, destination_dataset_table: Optional[str] = None, write_disposition: str = 'WRITE_EMPTY', allow_large_results: bool = False, flatten_results: Optional[bool] = None, udf_config: Optional[List] = None, use_legacy_sql: Optional[bool] = None, maximum_billing_tier: Optional[int] = None, maximum_bytes_billed: Optional[float] = None, create_disposition: str = 'CREATE_IF_NEEDED', query_params: Optional[List] = None, labels: Optional[Dict] = None, schema_update_options: Optional[Iterable] = None, priority: str = 'INTERACTIVE', time_partitioning: Optional[Dict] = None, api_resource_configs: Optional[Dict] = None, cluster_fields: Optional[List[str]] = None, location: Optional[str] = None, encryption_configuration: Optional[Dict] = None)[source]

Executes a BigQuery SQL query. Optionally persists results in a BigQuery table. See here:

https://cloud.google.com/bigquery/docs/reference/v2/jobs

For more details about these parameters.

Parameters
  • sql (str) – The BigQuery SQL to execute.

  • destination_dataset_table (str) – The dotted <dataset>.<table> BigQuery table to save the query results.

  • write_disposition (str) – What to do if the table already exists in BigQuery.

  • allow_large_results (bool) – Whether to allow large results.

  • flatten_results (bool) – If true and query uses legacy SQL dialect, flattens all nested and repeated fields in the query results. allowLargeResults must be true if this is set to false. For standard SQL queries, this flag is ignored and results are never flattened.

  • udf_config (list) – The User Defined Function configuration for the query. See https://cloud.google.com/bigquery/user-defined-functions for details.

  • use_legacy_sql (bool) – Whether to use legacy SQL (true) or standard SQL (false). If None, defaults to self.use_legacy_sql.

  • api_resource_configs (dict) – a dictionary that contain params ‘configuration’ applied for Google BigQuery Jobs API: https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs for example, {‘query’: {‘useQueryCache’: False}}. You could use it if you need to provide some params that are not supported by the BigQueryHook like args.

  • maximum_billing_tier (int) – Positive integer that serves as a multiplier of the basic price.

  • maximum_bytes_billed (float) – Limits the bytes billed for this job. Queries that will have bytes billed beyond this limit will fail (without incurring a charge). If unspecified, this will be set to your project default.

  • create_disposition (str) – Specifies whether the job is allowed to create new tables.

  • query_params (list) – a list of dictionary containing query parameter types and values, passed to BigQuery

  • labels (dict) – a dictionary containing labels for the job/query, passed to BigQuery

  • schema_update_options (Union[list, tuple, set]) – Allows the schema of the destination table to be updated as a side effect of the query job.

  • priority (str) – Specifies a priority for the query. Possible values include INTERACTIVE and BATCH. The default value is INTERACTIVE.

  • time_partitioning (dict) – configure optional time partitioning fields i.e. partition by field, type and expiration as per API specifications.

  • cluster_fields (list[str]) – Request that the result of this query be stored sorted by one or more columns. BigQuery supports clustering for both partitioned and non-partitioned tables. The order of columns given determines the sort order.

  • location (str) – The geographic location of the job. Required except for US and EU. See details at https://cloud.google.com/bigquery/docs/locations#specifying_your_location

  • encryption_configuration (dict) –

    [Optional] Custom encryption configuration (e.g., Cloud KMS keys). Example:

    encryption_configuration = {
        "kmsKeyName": "projects/testp/locations/us/keyRings/test-kr/cryptoKeys/test-key"
    }
    

class airflow.providers.google.cloud.hooks.bigquery.BigQueryPandasConnector(project_id: str, service: str, reauth: bool = False, verbose: bool = False, dialect='legacy')[source]

Bases: pandas_gbq.gbq.GbqConnector

This connector behaves identically to GbqConnector (from Pandas), except that it allows the service to be injected, and disables a call to self.get_credentials(). This allows Airflow to use BigQuery with Pandas without forcing a three legged OAuth connection. Instead, we can inject service account credentials into the binding.

class airflow.providers.google.cloud.hooks.bigquery.BigQueryConnection(*args, **kwargs)[source]

BigQuery does not have a notion of a persistent connection. Thus, these objects are small stateless factories for cursors, which do all the real work.

close(self)[source]

BigQueryConnection does not have anything to close

commit(self)[source]

BigQueryConnection does not support transactions

cursor(self)[source]

Return a new Cursor object using the connection

rollback(self)[source]

BigQueryConnection does not have transactions

class airflow.providers.google.cloud.hooks.bigquery.BigQueryBaseCursor(service: Any, project_id: str, hook: BigQueryHook, use_legacy_sql: bool = True, api_resource_configs: Optional[Dict] = None, location: Optional[str] = None, num_retries: int = 5)[source]

Bases: airflow.utils.log.logging_mixin.LoggingMixin

The BigQuery base cursor contains helper methods to execute queries against BigQuery. The methods can be used directly by operators, in cases where a PEP 249 cursor isn’t needed.

create_empty_table(self, *args, **kwargs)[source]

This method is deprecated. Please use airflow.providers.google.cloud.hooks.bigquery.BigQueryHook.create_empty_table

create_empty_dataset(self, *args, **kwargs)[source]

This method is deprecated. Please use airflow.providers.google.cloud.hooks.bigquery.BigQueryHook.create_empty_dataset

get_dataset_tables(self, *args, **kwargs)[source]

This method is deprecated. Please use airflow.providers.google.cloud.hooks.bigquery.BigQueryHook.get_dataset_tables

delete_dataset(self, *args, **kwargs)[source]

This method is deprecated. Please use airflow.providers.google.cloud.hooks.bigquery.BigQueryHook.delete_dataset

create_external_table(self, *args, **kwargs)[source]

This method is deprecated. Please use airflow.providers.google.cloud.hooks.bigquery.BigQueryHook.create_external_table

patch_table(self, *args, **kwargs)[source]

This method is deprecated. Please use airflow.providers.google.cloud.hooks.bigquery.BigQueryHook.patch_table

insert_all(self, *args, **kwargs)[source]

This method is deprecated. Please use airflow.providers.google.cloud.hooks.bigquery.BigQueryHook.insert_all

update_dataset(self, *args, **kwargs)[source]

This method is deprecated. Please use airflow.providers.google.cloud.hooks.bigquery.BigQueryHook.update_dataset

patch_dataset(self, *args, **kwargs)[source]

This method is deprecated. Please use airflow.providers.google.cloud.hooks.bigquery.BigQueryHook.patch_dataset

get_dataset_tables_list(self, *args, **kwargs)[source]

This method is deprecated. Please use airflow.providers.google.cloud.hooks.bigquery.BigQueryHook.get_dataset_tables_list

get_datasets_list(self, *args, **kwargs)[source]

This method is deprecated. Please use airflow.providers.google.cloud.hooks.bigquery.BigQueryHook.get_datasets_list

get_dataset(self, *args, **kwargs)[source]

This method is deprecated. Please use airflow.providers.google.cloud.hooks.bigquery.BigQueryHook.get_dataset

run_grant_dataset_view_access(self, *args, **kwargs)[source]

This method is deprecated. Please use airflow.providers.google.cloud.hooks.bigquery.BigQueryHook.run_grant_dataset_view_access

run_table_upsert(self, *args, **kwargs)[source]

This method is deprecated. Please use airflow.providers.google.cloud.hooks.bigquery.BigQueryHook.run_table_upsert

run_table_delete(self, *args, **kwargs)[source]

This method is deprecated. Please use airflow.providers.google.cloud.hooks.bigquery.BigQueryHook.run_table_delete

get_tabledata(self, *args, **kwargs)[source]

This method is deprecated. Please use airflow.providers.google.cloud.hooks.bigquery.BigQueryHook.get_tabledata

get_schema(self, *args, **kwargs)[source]

This method is deprecated. Please use airflow.providers.google.cloud.hooks.bigquery.BigQueryHook.get_schema

poll_job_complete(self, *args, **kwargs)[source]

This method is deprecated. Please use airflow.providers.google.cloud.hooks.bigquery.BigQueryHook.poll_job_complete

cancel_query(self, *args, **kwargs)[source]

This method is deprecated. Please use airflow.providers.google.cloud.hooks.bigquery.BigQueryHook.cancel_query

run_with_configuration(self, *args, **kwargs)[source]

This method is deprecated. Please use airflow.providers.google.cloud.hooks.bigquery.BigQueryHook.run_with_configuration

run_load(self, *args, **kwargs)[source]

This method is deprecated. Please use airflow.providers.google.cloud.hooks.bigquery.BigQueryHook.run_load

run_copy(self, *args, **kwargs)[source]

This method is deprecated. Please use airflow.providers.google.cloud.hooks.bigquery.BigQueryHook.run_copy

run_extract(self, *args, **kwargs)[source]

This method is deprecated. Please use airflow.providers.google.cloud.hooks.bigquery.BigQueryHook.run_extract

run_query(self, *args, **kwargs)[source]

This method is deprecated. Please use airflow.providers.google.cloud.hooks.bigquery.BigQueryHook.run_query

class airflow.providers.google.cloud.hooks.bigquery.BigQueryCursor(service: Any, project_id: str, hook: BigQueryHook, use_legacy_sql: bool = True, location: Optional[str] = None, num_retries: int = 5)[source]

Bases: airflow.providers.google.cloud.hooks.bigquery.BigQueryBaseCursor

A very basic BigQuery PEP 249 cursor implementation. The PyHive PEP 249 implementation was used as a reference:

https://github.com/dropbox/PyHive/blob/master/pyhive/presto.py https://github.com/dropbox/PyHive/blob/master/pyhive/common.py

description[source]

The schema description method is not currently implemented

rowcount[source]

By default, return -1 to indicate that this is not supported

arraysize[source]
close(self)[source]

By default, do nothing

execute(self, operation: str, parameters: Optional[dict] = None)[source]

Executes a BigQuery query, and returns the job ID.

Parameters
  • operation (str) – The query to execute.

  • parameters (dict) – Parameters to substitute into the query.

executemany(self, operation: str, seq_of_parameters: list)[source]

Execute a BigQuery query multiple times with different parameters.

Parameters
  • operation (str) – The query to execute.

  • seq_of_parameters (list) – List of dictionary parameters to substitute into the query.

flush_results(self)[source]

Flush results related cursor attributes

fetchone(self)[source]

Fetch the next row of a query result set

next(self)[source]

Helper method for fetchone, which returns the next row from a buffer. If the buffer is empty, attempts to paginate through the result set for the next page, and load it into the buffer.

fetchmany(self, size: Optional[int] = None)[source]

Fetch the next set of rows of a query result, returning a sequence of sequences (e.g. a list of tuples). An empty sequence is returned when no more rows are available. The number of rows to fetch per call is specified by the parameter. If it is not given, the cursor’s arraysize determines the number of rows to be fetched. The method should try to fetch as many rows as indicated by the size parameter. If this is not possible due to the specified number of rows not being available, fewer rows may be returned. An Error (or subclass) exception is raised if the previous call to execute() did not produce any result set or no call was issued yet.

fetchall(self)[source]

Fetch all (remaining) rows of a query result, returning them as a sequence of sequences (e.g. a list of tuples).

get_arraysize(self)[source]

Specifies the number of rows to fetch at a time with .fetchmany()

set_arraysize(self, arraysize: int)[source]

Specifies the number of rows to fetch at a time with .fetchmany()

setinputsizes(self, sizes: Any)[source]

Does nothing by default

setoutputsize(self, size: Any, column: Any = None)[source]

Does nothing by default

airflow.providers.google.cloud.hooks.bigquery._bind_parameters(operation: str, parameters: dict)str[source]
Helper method that binds parameters to a SQL query
airflow.providers.google.cloud.hooks.bigquery._escape(s: str)str[source]
Helper method that escapes parameters to a SQL query
airflow.providers.google.cloud.hooks.bigquery._bq_cast(string_field: str, bq_type: str) → Union[None, int, float, bool, str][source]
Helper method that casts a BigQuery row to the appropriate data types.
This is useful because BigQuery returns all fields as strings.
airflow.providers.google.cloud.hooks.bigquery._split_tablename(table_input: str, default_project_id: str, var_name: Optional[str] = None) → Tuple[str, str, str][source]
airflow.providers.google.cloud.hooks.bigquery._cleanse_time_partitioning(destination_dataset_table: Optional[str], time_partitioning_in: Optional[Dict]) → Dict[source]
airflow.providers.google.cloud.hooks.bigquery._validate_value(key: Any, value: Any, expected_type: Type)None[source]
Function to check expected type and raise error if type is not correct
airflow.providers.google.cloud.hooks.bigquery._api_resource_configs_duplication_check(key: Any, value: Any, config_dict: dict, config_dict_name='api_resource_configs')None[source]
airflow.providers.google.cloud.hooks.bigquery._validate_src_fmt_configs(source_format: str, src_fmt_configs: dict, valid_configs: List[str], backward_compatibility_configs: Optional[Dict] = None) → Dict[source]
Validates the given src_fmt_configs against a valid configuration for the source format.
Adds the backward compatibility config to the src_fmt_configs.
Parameters
  • source_format (str) – File format to export.

  • src_fmt_configs (dict) – Configure optional fields specific to the source format.

  • valid_configs (List[str]) – Valid configuration specific to the source format

  • backward_compatibility_configs (dict) – The top-level params for backward-compatibility

Was this entry helpful?