airflow.providers.google.cloud.operators.gcs

This module contains a Google Cloud Storage Bucket operator.

Module Contents

class airflow.providers.google.cloud.operators.gcs.GCSCreateBucketOperator(*, bucket_name: str, resource: Optional[Dict] = None, storage_class: str = 'MULTI_REGIONAL', location: str = 'US', project_id: Optional[str] = None, labels: Optional[Dict] = None, gcp_conn_id: str = 'google_cloud_default', google_cloud_storage_conn_id: Optional[str] = None, delegate_to: Optional[str] = None, impersonation_chain: Optional[Union[str, Sequence[str]]] = None, **kwargs)[source]

Bases: airflow.models.BaseOperator

Creates a new bucket. Google Cloud Storage uses a flat namespace, so you can't create a bucket with a name that is already in use.

See also

For more information, see Bucket Naming Guidelines: https://cloud.google.com/storage/docs/bucketnaming.html#requirements

Parameters
  • bucket_name (str) -- The name of the bucket. (templated)

  • resource (dict) -- An optional dict with parameters for creating the bucket. For information on available parameters, see Cloud Storage API doc: https://cloud.google.com/storage/docs/json_api/v1/buckets/insert

  • storage_class (str) --

    This defines how objects in the bucket are stored and determines the SLA and the cost of storage (templated). Values include

    • MULTI_REGIONAL

    • REGIONAL

    • STANDARD

    • NEARLINE

    • COLDLINE.

    If this value is not specified when the bucket is created, it will default to STANDARD.

  • location (str) --

    The location of the bucket. (templated) Object data for objects in the bucket resides in physical storage within this region. Defaults to US.

  • project_id (str) -- The ID of the Google Cloud Project. (templated)

  • labels (dict) -- User-provided labels, in key/value pairs.

  • gcp_conn_id (str) -- (Optional) The connection ID used to connect to Google Cloud.

  • google_cloud_storage_conn_id (str) -- (Deprecated) The connection ID used to connect to Google Cloud. This parameter has been deprecated. You should pass the gcp_conn_id parameter instead.

  • delegate_to (str) -- The account to impersonate using domain-wide delegation of authority, if any. For this to work, the service account making the request must have domain-wide delegation enabled.

  • impersonation_chain (Union[str, Sequence[str]]) -- Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

The following Operator would create a new bucket test-bucket with MULTI_REGIONAL storage class in EU region

CreateBucket = GoogleCloudStorageCreateBucketOperator(
    task_id='CreateNewBucket',
    bucket_name='test-bucket',
    storage_class='MULTI_REGIONAL',
    location='EU',
    labels={'env': 'dev', 'team': 'airflow'},
    gcp_conn_id='airflow-conn-id'
)
template_fields = ['bucket_name', 'storage_class', 'location', 'project_id', 'impersonation_chain'][source]
ui_color = #f0eee4[source]
execute(self, context)[source]
class airflow.providers.google.cloud.operators.gcs.GCSListObjectsOperator(*, bucket: str, prefix: Optional[str] = None, delimiter: Optional[str] = None, gcp_conn_id: str = 'google_cloud_default', google_cloud_storage_conn_id: Optional[str] = None, delegate_to: Optional[str] = None, impersonation_chain: Optional[Union[str, Sequence[str]]] = None, **kwargs)[source]

Bases: airflow.models.BaseOperator

List all objects from the bucket with the give string prefix and delimiter in name.

This operator returns a python list with the name of objects which can be used by

xcom in the downstream task.

Parameters
  • bucket (str) -- The Google Cloud Storage bucket to find the objects. (templated)

  • prefix (str) -- Prefix string which filters objects whose name begin with this prefix. (templated)

  • delimiter (str) -- The delimiter by which you want to filter the objects. (templated) For e.g to lists the CSV files from in a directory in GCS you would use delimiter='.csv'.

  • gcp_conn_id (str) -- (Optional) The connection ID used to connect to Google Cloud.

  • google_cloud_storage_conn_id -- (Deprecated) The connection ID used to connect to Google Cloud. This parameter has been deprecated. You should pass the gcp_conn_id parameter instead.

  • delegate_to (str) -- The account to impersonate using domain-wide delegation of authority, if any. For this to work, the service account making the request must have domain-wide delegation enabled.

  • impersonation_chain (Union[str, Sequence[str]]) -- Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

Example:

The following Operator would list all the Avro files from sales/sales-2017 folder in data bucket.

GCS_Files = GoogleCloudStorageListOperator(
    task_id='GCS_Files',
    bucket='data',
    prefix='sales/sales-2017/',
    delimiter='.avro',
    gcp_conn_id=google_cloud_conn_id
)
template_fields :Iterable[str] = ['bucket', 'prefix', 'delimiter', 'impersonation_chain'][source]
ui_color = #f0eee4[source]
execute(self, context)[source]
class airflow.providers.google.cloud.operators.gcs.GCSDeleteObjectsOperator(*, bucket_name: str, objects: Optional[Iterable[str]] = None, prefix: Optional[str] = None, gcp_conn_id: str = 'google_cloud_default', google_cloud_storage_conn_id: Optional[str] = None, delegate_to: Optional[str] = None, impersonation_chain: Optional[Union[str, Sequence[str]]] = None, **kwargs)[source]

Bases: airflow.models.BaseOperator

Deletes objects from a Google Cloud Storage bucket, either from an explicit list of object names or all objects matching a prefix.

Parameters
  • bucket_name (str) -- The GCS bucket to delete from

  • objects (Iterable[str]) -- List of objects to delete. These should be the names of objects in the bucket, not including gs://bucket/

  • prefix -- Prefix of objects to delete. All objects matching this prefix in the bucket will be deleted.

  • gcp_conn_id (str) -- (Optional) The connection ID used to connect to Google Cloud.

  • google_cloud_storage_conn_id (str) -- (Deprecated) The connection ID used to connect to Google Cloud. This parameter has been deprecated. You should pass the gcp_conn_id parameter instead.

  • delegate_to (str) -- The account to impersonate using domain-wide delegation of authority, if any. For this to work, the service account making the request must have domain-wide delegation enabled.

  • impersonation_chain (Union[str, Sequence[str]]) -- Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

template_fields = ['bucket_name', 'prefix', 'objects', 'impersonation_chain'][source]
execute(self, context)[source]
class airflow.providers.google.cloud.operators.gcs.GCSBucketCreateAclEntryOperator(*, bucket: str, entity: str, role: str, user_project: Optional[str] = None, gcp_conn_id: str = 'google_cloud_default', google_cloud_storage_conn_id: Optional[str] = None, impersonation_chain: Optional[Union[str, Sequence[str]]] = None, **kwargs)[source]

Bases: airflow.models.BaseOperator

Creates a new ACL entry on the specified bucket.

See also

For more information on how to use this operator, take a look at the guide: GCSBucketCreateAclEntryOperator

Parameters
  • bucket (str) -- Name of a bucket.

  • entity (str) -- The entity holding the permission, in one of the following forms: user-userId, user-email, group-groupId, group-email, domain-domain, project-team-projectId, allUsers, allAuthenticatedUsers

  • role (str) -- The access permission for the entity. Acceptable values are: "OWNER", "READER", "WRITER".

  • user_project (str) -- (Optional) The project to be billed for this request. Required for Requester Pays buckets.

  • gcp_conn_id (str) -- (Optional) The connection ID used to connect to Google Cloud.

  • google_cloud_storage_conn_id (str) -- (Deprecated) The connection ID used to connect to Google Cloud. This parameter has been deprecated. You should pass the gcp_conn_id parameter instead.

  • impersonation_chain (Union[str, Sequence[str]]) -- Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

template_fields = ['bucket', 'entity', 'role', 'user_project', 'impersonation_chain'][source]
execute(self, context)[source]
class airflow.providers.google.cloud.operators.gcs.GCSObjectCreateAclEntryOperator(*, bucket: str, object_name: str, entity: str, role: str, generation: Optional[int] = None, user_project: Optional[str] = None, gcp_conn_id: str = 'google_cloud_default', google_cloud_storage_conn_id: Optional[str] = None, impersonation_chain: Optional[Union[str, Sequence[str]]] = None, **kwargs)[source]

Bases: airflow.models.BaseOperator

Creates a new ACL entry on the specified object.

See also

For more information on how to use this operator, take a look at the guide: GCSObjectCreateAclEntryOperator

Parameters
  • bucket (str) -- Name of a bucket.

  • object_name (str) -- Name of the object. For information about how to URL encode object names to be path safe, see: https://cloud.google.com/storage/docs/json_api/#encoding

  • entity (str) -- The entity holding the permission, in one of the following forms: user-userId, user-email, group-groupId, group-email, domain-domain, project-team-projectId, allUsers, allAuthenticatedUsers

  • role (str) -- The access permission for the entity. Acceptable values are: "OWNER", "READER".

  • generation (long) -- Optional. If present, selects a specific revision of this object.

  • user_project (str) -- (Optional) The project to be billed for this request. Required for Requester Pays buckets.

  • gcp_conn_id (str) -- (Optional) The connection ID used to connect to Google Cloud.

  • google_cloud_storage_conn_id (str) -- (Deprecated) The connection ID used to connect to Google Cloud. This parameter has been deprecated. You should pass the gcp_conn_id parameter instead.

  • impersonation_chain (Union[str, Sequence[str]]) -- Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

template_fields = ['bucket', 'object_name', 'entity', 'generation', 'role', 'user_project', 'impersonation_chain'][source]
execute(self, context)[source]
class airflow.providers.google.cloud.operators.gcs.GCSFileTransformOperator(*, source_bucket: str, source_object: str, transform_script: Union[str, List[str]], destination_bucket: Optional[str] = None, destination_object: Optional[str] = None, gcp_conn_id: str = 'google_cloud_default', impersonation_chain: Optional[Union[str, Sequence[str]]] = None, **kwargs)[source]

Bases: airflow.models.BaseOperator

Copies data from a source GCS location to a temporary location on the local filesystem. Runs a transformation on this file as specified by the transformation script and uploads the output to a destination bucket. If the output bucket is not specified the original file will be overwritten.

The locations of the source and the destination files in the local filesystem is provided as an first and second arguments to the transformation script. The transformation script is expected to read the data from source, transform it and write the output to the local destination file.

Parameters
  • source_bucket (str) -- The key to be retrieved from S3. (templated)

  • destination_bucket (str) -- The key to be written from S3. (templated)

  • transform_script (Union[str, List[str]]) -- location of the executable transformation script or list of arguments passed to subprocess ex. ['python', 'script.py', 10]. (templated)

  • gcp_conn_id (str) -- The connection ID to use connecting to Google Cloud.

  • impersonation_chain (Union[str, Sequence[str]]) -- Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

template_fields = ['source_bucket', 'destination_bucket', 'transform_script', 'impersonation_chain'][source]
execute(self, context: dict)[source]
class airflow.providers.google.cloud.operators.gcs.GCSDeleteBucketOperator(*, bucket_name: str, force: bool = True, gcp_conn_id: str = 'google_cloud_default', impersonation_chain: Optional[Union[str, Sequence[str]]] = None, **kwargs)[source]

Bases: airflow.models.BaseOperator

Deletes bucket from a Google Cloud Storage.

See also

For more information on how to use this operator, take a look at the guide: Deleting Bucket

Parameters
  • bucket_name (str) -- name of the bucket which will be deleted

  • force -- false not allow to delete non empty bucket, set force=True allows to delete non empty bucket

  • gcp_conn_id (str) -- The connection ID to use connecting to Google Cloud.

  • impersonation_chain (Union[str, Sequence[str]]) -- Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

Type

bool

template_fields = ['bucket_name', 'gcp_conn_id', 'impersonation_chain'][source]
execute(self, context)[source]
class airflow.providers.google.cloud.operators.gcs.GCSSynchronizeBucketsOperator(*, source_bucket: str, destination_bucket: str, source_object: Optional[str] = None, destination_object: Optional[str] = None, recursive: bool = True, delete_extra_files: bool = False, allow_overwrite: bool = False, gcp_conn_id: str = 'google_cloud_default', delegate_to: Optional[str] = None, impersonation_chain: Optional[Union[str, Sequence[str]]] = None, **kwargs)[source]

Bases: airflow.models.BaseOperator

Synchronizes the contents of the buckets or bucket's directories in the Google Cloud Services.

Parameters source_object and destination_object describe the root sync directory. If they are not passed, the entire bucket will be synchronized. They should point to directories.

Note

The synchronization of individual files is not supported. Only entire directories can be synchronized.

See also

For more information on how to use this operator, take a look at the guide: GCSSynchronizeBuckets

Parameters
  • source_bucket (str) -- The name of the bucket containing the source objects.

  • destination_bucket (str) -- The name of the bucket containing the destination objects.

  • source_object (Optional[str]) -- The root sync directory in the source bucket.

  • destination_object (Optional[str]) -- The root sync directory in the destination bucket.

  • recursive (bool) -- If True, subdirectories will be considered

  • allow_overwrite (bool) -- if True, the files will be overwritten if a mismatched file is found. By default, overwriting files is not allowed

  • delete_extra_files (bool) --

    if True, deletes additional files from the source that not found in the destination. By default extra files are not deleted.

    Note

    This option can delete data quickly if you specify the wrong source/destination combination.

  • gcp_conn_id (str) -- (Optional) The connection ID used to connect to Google Cloud.

  • delegate_to (str) -- The account to impersonate using domain-wide delegation of authority, if any. For this to work, the service account making the request must have domain-wide delegation enabled.

  • impersonation_chain (Union[str, Sequence[str]]) -- Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

template_fields = ['source_bucket', 'destination_bucket', 'source_object', 'destination_object', 'recursive', 'delete_extra_files', 'allow_overwrite', 'gcp_conn_id', 'delegate_to', 'impersonation_chain'][source]
execute(self, context)[source]

Was this entry helpful?