airflow.providers.google.cloud.operators.gcs

This module contains a Google Cloud Storage Bucket operator.

Module Contents

Classes

GCSCreateBucketOperator

Creates a new bucket.

GCSListObjectsOperator

List all objects from the bucket filtered by given string prefix and delimiter in name or match_glob.

GCSDeleteObjectsOperator

Deletes objects from a list or all objects matching a prefix from a Google Cloud Storage bucket.

GCSBucketCreateAclEntryOperator

Creates a new ACL entry on the specified bucket.

GCSObjectCreateAclEntryOperator

Creates a new ACL entry on the specified object.

GCSFileTransformOperator

Copies data from a source GCS location to a temporary location on the local filesystem.

GCSTimeSpanFileTransformOperator

Copy objects that were modified during a time span, run a transform, and upload results to a bucket.

GCSDeleteBucketOperator

Deletes bucket from a Google Cloud Storage.

GCSSynchronizeBucketsOperator

Synchronizes the contents of the buckets or bucket's directories in the Google Cloud Services.

class airflow.providers.google.cloud.operators.gcs.GCSCreateBucketOperator(*, bucket_name, resource=None, storage_class='MULTI_REGIONAL', location='US', project_id=None, labels=None, gcp_conn_id='google_cloud_default', impersonation_chain=None, **kwargs)[source]

Bases: airflow.providers.google.cloud.operators.cloud_base.GoogleCloudBaseOperator

Creates a new bucket.

Google Cloud Storage uses a flat namespace, so you can’t create a bucket with a name that is already in use.

See also

For more information, see Bucket Naming Guidelines: https://cloud.google.com/storage/docs/bucketnaming.html#requirements

Parameters
  • bucket_name (str) – The name of the bucket. (templated)

  • resource (dict | None) – An optional dict with parameters for creating the bucket. For information on available parameters, see Cloud Storage API doc: https://cloud.google.com/storage/docs/json_api/v1/buckets/insert

  • storage_class (str) –

    This defines how objects in the bucket are stored and determines the SLA and the cost of storage (templated). Values include

    • MULTI_REGIONAL

    • REGIONAL

    • STANDARD

    • NEARLINE

    • COLDLINE.

    If this value is not specified when the bucket is created, it will default to STANDARD.

  • location (str) –

    The location of the bucket. (templated) Object data for objects in the bucket resides in physical storage within this region. Defaults to US.

  • project_id (str | None) – The ID of the Google Cloud Project. (templated)

  • labels (dict | None) – User-provided labels, in key/value pairs.

  • gcp_conn_id (str) – (Optional) The connection ID used to connect to Google Cloud.

  • impersonation_chain (str | Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

The following Operator would create a new bucket test-bucket with MULTI_REGIONAL storage class in EU region

CreateBucket = GCSCreateBucketOperator(
    task_id="CreateNewBucket",
    bucket_name="test-bucket",
    storage_class="MULTI_REGIONAL",
    location="EU",
    labels={"env": "dev", "team": "airflow"},
    gcp_conn_id="airflow-conn-id",
)
template_fields: Sequence[str] = ('bucket_name', 'storage_class', 'location', 'project_id', 'impersonation_chain')[source]
ui_color = '#f0eee4'[source]
execute(context)[source]

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

class airflow.providers.google.cloud.operators.gcs.GCSListObjectsOperator(*, bucket, prefix=None, delimiter=None, gcp_conn_id='google_cloud_default', impersonation_chain=None, match_glob=None, **kwargs)[source]

Bases: airflow.providers.google.cloud.operators.cloud_base.GoogleCloudBaseOperator

List all objects from the bucket filtered by given string prefix and delimiter in name or match_glob.

This operator returns a python list with the name of objects which can be used by XCom in the downstream task.

Parameters
  • bucket (str) – The Google Cloud Storage bucket to find the objects. (templated)

  • prefix (str | list[str] | None) – String or list of strings, which filter objects whose name begins with it/them. (templated)

  • delimiter (str | None) – (Deprecated) The delimiter by which you want to filter the objects. (templated) For example, to list the CSV files from in a directory in GCS you would use delimiter=’.csv’.

  • gcp_conn_id (str) – (Optional) The connection ID used to connect to Google Cloud.

  • impersonation_chain (str | Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

  • match_glob (str | None) – (Optional) filters objects based on the glob pattern given by the string (e.g, '**/*/.json')

Example:

The following Operator would list all the Avro files from sales/sales-2017 folder in data bucket.

GCS_Files = GCSListOperator(
    task_id="GCS_Files",
    bucket="data",
    prefix="sales/sales-2017/",
    match_glob="**/*/.avro",
    gcp_conn_id=google_cloud_conn_id,
)
template_fields: Sequence[str] = ('bucket', 'prefix', 'delimiter', 'impersonation_chain')[source]
ui_color = '#f0eee4'[source]
execute(context)[source]

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

class airflow.providers.google.cloud.operators.gcs.GCSDeleteObjectsOperator(*, bucket_name, objects=None, prefix=None, gcp_conn_id='google_cloud_default', impersonation_chain=None, **kwargs)[source]

Bases: airflow.providers.google.cloud.operators.cloud_base.GoogleCloudBaseOperator

Deletes objects from a list or all objects matching a prefix from a Google Cloud Storage bucket.

Parameters
  • bucket_name (str) – The GCS bucket to delete from

  • objects (list[str] | None) – List of objects to delete. These should be the names of objects in the bucket, not including gs://bucket/

  • prefix (str | None) – String or list of strings, which filter objects whose name begin with it/them. (templated)

  • gcp_conn_id (str) – (Optional) The connection ID used to connect to Google Cloud.

  • impersonation_chain (str | Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

template_fields: Sequence[str] = ('bucket_name', 'prefix', 'objects', 'impersonation_chain')[source]
execute(context)[source]

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

get_openlineage_facets_on_complete(task_instance)[source]

Implementing on_complete as execute() resolves object names.

class airflow.providers.google.cloud.operators.gcs.GCSBucketCreateAclEntryOperator(*, bucket, entity, role, user_project=None, gcp_conn_id='google_cloud_default', impersonation_chain=None, **kwargs)[source]

Bases: airflow.providers.google.cloud.operators.cloud_base.GoogleCloudBaseOperator

Creates a new ACL entry on the specified bucket.

See also

For more information on how to use this operator, take a look at the guide: GCSBucketCreateAclEntryOperator

Parameters
  • bucket (str) – Name of a bucket.

  • entity (str) – The entity holding the permission, in one of the following forms: user-userId, user-email, group-groupId, group-email, domain-domain, project-team-projectId, allUsers, allAuthenticatedUsers

  • role (str) – The access permission for the entity. Acceptable values are: “OWNER”, “READER”, “WRITER”.

  • user_project (str | None) – (Optional) The project to be billed for this request. Required for Requester Pays buckets.

  • gcp_conn_id (str) – (Optional) The connection ID used to connect to Google Cloud.

  • impersonation_chain (str | Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

template_fields: Sequence[str] = ('bucket', 'entity', 'role', 'user_project', 'impersonation_chain')[source]
execute(context)[source]

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

class airflow.providers.google.cloud.operators.gcs.GCSObjectCreateAclEntryOperator(*, bucket, object_name, entity, role, generation=None, user_project=None, gcp_conn_id='google_cloud_default', impersonation_chain=None, **kwargs)[source]

Bases: airflow.providers.google.cloud.operators.cloud_base.GoogleCloudBaseOperator

Creates a new ACL entry on the specified object.

See also

For more information on how to use this operator, take a look at the guide: GCSObjectCreateAclEntryOperator

Parameters
  • bucket (str) – Name of a bucket.

  • object_name (str) – Name of the object. For information about how to URL encode object names to be path safe, see: https://cloud.google.com/storage/docs/json_api/#encoding

  • entity (str) – The entity holding the permission, in one of the following forms: user-userId, user-email, group-groupId, group-email, domain-domain, project-team-projectId, allUsers, allAuthenticatedUsers

  • role (str) – The access permission for the entity. Acceptable values are: “OWNER”, “READER”.

  • generation (int | None) – Optional. If present, selects a specific revision of this object.

  • user_project (str | None) – (Optional) The project to be billed for this request. Required for Requester Pays buckets.

  • gcp_conn_id (str) – (Optional) The connection ID used to connect to Google Cloud.

  • impersonation_chain (str | Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

template_fields: Sequence[str] = ('bucket', 'object_name', 'entity', 'generation', 'role', 'user_project', 'impersonation_chain')[source]
execute(context)[source]

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

class airflow.providers.google.cloud.operators.gcs.GCSFileTransformOperator(*, source_bucket, source_object, transform_script, destination_bucket=None, destination_object=None, gcp_conn_id='google_cloud_default', impersonation_chain=None, **kwargs)[source]

Bases: airflow.providers.google.cloud.operators.cloud_base.GoogleCloudBaseOperator

Copies data from a source GCS location to a temporary location on the local filesystem.

Runs a transformation on this file as specified by the transformation script and uploads the output to a destination bucket. If the output bucket is not specified the original file will be overwritten.

The locations of the source and the destination files in the local filesystem is provided as an first and second arguments to the transformation script. The transformation script is expected to read the data from source, transform it and write the output to the local destination file.

Parameters
  • source_bucket (str) – The bucket to locate the source_object. (templated)

  • source_object (str) – The key to be retrieved from GCS. (templated)

  • destination_bucket (str | None) – The bucket to upload the key after transformation. If not provided, source_bucket will be used. (templated)

  • destination_object (str | None) – The key to be written in GCS. If not provided, source_object will be used. (templated)

  • transform_script (str | list[str]) – location of the executable transformation script or list of arguments passed to subprocess ex. [‘python’, ‘script.py’, 10]. (templated)

  • gcp_conn_id (str) – The connection ID to use connecting to Google Cloud.

  • impersonation_chain (str | Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

template_fields: Sequence[str] = ('source_bucket', 'source_object', 'destination_bucket', 'destination_object',...[source]
execute(context)[source]

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

get_openlineage_facets_on_start()[source]
class airflow.providers.google.cloud.operators.gcs.GCSTimeSpanFileTransformOperator(*, source_bucket, source_prefix, source_gcp_conn_id, destination_bucket, destination_prefix, destination_gcp_conn_id, transform_script, source_impersonation_chain=None, destination_impersonation_chain=None, chunk_size=None, download_continue_on_fail=False, download_num_attempts=1, upload_continue_on_fail=False, upload_num_attempts=1, **kwargs)[source]

Bases: airflow.providers.google.cloud.operators.cloud_base.GoogleCloudBaseOperator

Copy objects that were modified during a time span, run a transform, and upload results to a bucket.

Determines a list of objects that were added or modified at a GCS source location during a specific time-span, copies them to a temporary location on the local file system, runs a transform on this file as specified by the transformation script and uploads the output to the destination bucket.

See also

For more information on how to use this operator, take a look at the guide: GCSTimeSpanFileTransformOperator

The locations of the source and the destination files in the local filesystem is provided as an first and second arguments to the transformation script. The time-span is passed to the transform script as third and fourth argument as UTC ISO 8601 string.

The transformation script is expected to read the data from source, transform it and write the output to the local destination file.

Parameters
  • source_bucket (str) – The bucket to fetch data from. (templated)

  • source_prefix (str) – Prefix string which filters objects whose name begin with this prefix. Can interpolate execution date and time components. (templated)

  • source_gcp_conn_id (str) – The connection ID to use connecting to Google Cloud to download files to be processed.

  • source_impersonation_chain (str | Sequence[str] | None) – Optional service account to impersonate using short-term credentials (to download files to be processed), or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

  • destination_bucket (str) – The bucket to write data to. (templated)

  • destination_prefix (str) – Prefix string for the upload location. Can interpolate execution date and time components. (templated)

  • destination_gcp_conn_id (str) – The connection ID to use connecting to Google Cloud to upload processed files.

  • destination_impersonation_chain (str | Sequence[str] | None) – Optional service account to impersonate using short-term credentials (to upload processed files), or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

  • transform_script (str | list[str]) – location of the executable transformation script or list of arguments passed to subprocess ex. [‘python’, ‘script.py’, 10]. (templated)

  • chunk_size (int | None) – The size of a chunk of data when downloading or uploading (in bytes). This must be a multiple of 256 KB (per the google clout storage API specification).

  • download_continue_on_fail (bool | None) – With this set to true, if a download fails the task does not error out but will still continue.

  • upload_chunk_size – The size of a chunk of data when uploading (in bytes). This must be a multiple of 256 KB (per the google clout storage API specification).

  • upload_continue_on_fail (bool | None) – With this set to true, if an upload fails the task does not error out but will still continue.

  • upload_num_attempts (int) – Number of attempts to try to upload a single file.

template_fields: Sequence[str] = ('source_bucket', 'source_prefix', 'destination_bucket', 'destination_prefix',...[source]
static interpolate_prefix(prefix, dt)[source]

Interpolate prefix with datetime.

Parameters
execute(context)[source]

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

get_openlineage_facets_on_complete(task_instance)[source]

Implementing on_complete as execute() resolves object names.

class airflow.providers.google.cloud.operators.gcs.GCSDeleteBucketOperator(*, bucket_name, force=True, gcp_conn_id='google_cloud_default', impersonation_chain=None, user_project=None, **kwargs)[source]

Bases: airflow.providers.google.cloud.operators.cloud_base.GoogleCloudBaseOperator

Deletes bucket from a Google Cloud Storage.

See also

For more information on how to use this operator, take a look at the guide: Deleting Bucket

Parameters
  • bucket_name (str) – name of the bucket which will be deleted

  • force (bool) – false not allow to delete non empty bucket, set force=True allows to delete non empty bucket

  • gcp_conn_id (str) – The connection ID to use connecting to Google Cloud.

  • impersonation_chain (str | Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

  • user_project (str | None) – (Optional) The identifier of the project to bill for this request. Required for Requester Pays buckets.

template_fields: Sequence[str] = ('bucket_name', 'gcp_conn_id', 'impersonation_chain', 'user_project')[source]
execute(context)[source]

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

class airflow.providers.google.cloud.operators.gcs.GCSSynchronizeBucketsOperator(*, source_bucket, destination_bucket, source_object=None, destination_object=None, recursive=True, delete_extra_files=False, allow_overwrite=False, gcp_conn_id='google_cloud_default', impersonation_chain=None, **kwargs)[source]

Bases: airflow.providers.google.cloud.operators.cloud_base.GoogleCloudBaseOperator

Synchronizes the contents of the buckets or bucket’s directories in the Google Cloud Services.

Parameters source_object and destination_object describe the root sync directory. If they are not passed, the entire bucket will be synchronized. They should point to directories.

Note

The synchronization of individual files is not supported. Only entire directories can be synchronized.

See also

For more information on how to use this operator, take a look at the guide: GCSSynchronizeBucketsOperator

Parameters
  • source_bucket (str) – The name of the bucket containing the source objects.

  • destination_bucket (str) – The name of the bucket containing the destination objects.

  • source_object (str | None) – The root sync directory in the source bucket.

  • destination_object (str | None) – The root sync directory in the destination bucket.

  • recursive (bool) – If True, subdirectories will be considered

  • allow_overwrite (bool) – if True, the files will be overwritten if a mismatched file is found. By default, overwriting files is not allowed

  • delete_extra_files (bool) –

    if True, deletes additional files from the source that not found in the destination. By default extra files are not deleted.

    Note

    This option can delete data quickly if you specify the wrong source/destination combination.

  • gcp_conn_id (str) – (Optional) The connection ID used to connect to Google Cloud.

  • impersonation_chain (str | Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

template_fields: Sequence[str] = ('source_bucket', 'destination_bucket', 'source_object', 'destination_object', 'recursive',...[source]
execute(context)[source]

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

Was this entry helpful?