airflow.providers.google.cloud.operators.gcs

This module contains a Google Cloud Storage Bucket operator.

Module Contents

class airflow.providers.google.cloud.operators.gcs.GCSCreateBucketOperator(*, bucket_name: str, resource: Optional[Dict] = None, storage_class: str = 'MULTI_REGIONAL', location: str = 'US', project_id: Optional[str] = None, labels: Optional[Dict] = None, gcp_conn_id: str = 'google_cloud_default', google_cloud_storage_conn_id: Optional[str] = None, delegate_to: Optional[str] = None, impersonation_chain: Optional[Union[str, Sequence[str]]] = None, **kwargs)[source]

Bases: airflow.models.BaseOperator

Creates a new bucket. Google Cloud Storage uses a flat namespace, so you can't create a bucket with a name that is already in use.

See also

For more information, see Bucket Naming Guidelines: https://cloud.google.com/storage/docs/bucketnaming.html#requirements

Parameters
  • bucket_name (str) -- The name of the bucket. (templated)

  • resource (dict) -- An optional dict with parameters for creating the bucket. For information on available parameters, see Cloud Storage API doc: https://cloud.google.com/storage/docs/json_api/v1/buckets/insert

  • storage_class (str) --

    This defines how objects in the bucket are stored and determines the SLA and the cost of storage (templated). Values include

    • MULTI_REGIONAL

    • REGIONAL

    • STANDARD

    • NEARLINE

    • COLDLINE.

    If this value is not specified when the bucket is created, it will default to STANDARD.

  • location (str) --

    The location of the bucket. (templated) Object data for objects in the bucket resides in physical storage within this region. Defaults to US.

  • project_id (str) -- The ID of the Google Cloud Project. (templated)

  • labels (dict) -- User-provided labels, in key/value pairs.

  • gcp_conn_id (str) -- (Optional) The connection ID used to connect to Google Cloud.

  • google_cloud_storage_conn_id (str) -- (Deprecated) The connection ID used to connect to Google Cloud. This parameter has been deprecated. You should pass the gcp_conn_id parameter instead.

  • delegate_to (str) -- The account to impersonate using domain-wide delegation of authority, if any. For this to work, the service account making the request must have domain-wide delegation enabled.

  • impersonation_chain (Union[str, Sequence[str]]) -- Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

The following Operator would create a new bucket test-bucket with MULTI_REGIONAL storage class in EU region

CreateBucket = GoogleCloudStorageCreateBucketOperator(
    task_id="CreateNewBucket",
    bucket_name="test-bucket",
    storage_class="MULTI_REGIONAL",
    location="EU",
    labels={"env": "dev", "team": "airflow"},
    gcp_conn_id="airflow-conn-id",
)
template_fields = ['bucket_name', 'storage_class', 'location', 'project_id', 'impersonation_chain'][source]
ui_color = #f0eee4[source]
execute(self, context)[source]
class airflow.providers.google.cloud.operators.gcs.GCSListObjectsOperator(*, bucket: str, prefix: Optional[str] = None, delimiter: Optional[str] = None, gcp_conn_id: str = 'google_cloud_default', google_cloud_storage_conn_id: Optional[str] = None, delegate_to: Optional[str] = None, impersonation_chain: Optional[Union[str, Sequence[str]]] = None, **kwargs)[source]

Bases: airflow.models.BaseOperator

List all objects from the bucket with the given string prefix and delimiter in name.

This operator returns a python list with the name of objects which can be used by

xcom in the downstream task.

Parameters
  • bucket (str) -- The Google Cloud Storage bucket to find the objects. (templated)

  • prefix (str) -- Prefix string which filters objects whose name begin with this prefix. (templated)

  • delimiter (str) -- The delimiter by which you want to filter the objects. (templated) For e.g to lists the CSV files from in a directory in GCS you would use delimiter='.csv'.

  • gcp_conn_id (str) -- (Optional) The connection ID used to connect to Google Cloud.

  • google_cloud_storage_conn_id -- (Deprecated) The connection ID used to connect to Google Cloud. This parameter has been deprecated. You should pass the gcp_conn_id parameter instead.

  • delegate_to (str) -- The account to impersonate using domain-wide delegation of authority, if any. For this to work, the service account making the request must have domain-wide delegation enabled.

  • impersonation_chain (Union[str, Sequence[str]]) -- Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

Example:

The following Operator would list all the Avro files from sales/sales-2017 folder in data bucket.

GCS_Files = GoogleCloudStorageListOperator(
    task_id='GCS_Files',
    bucket='data',
    prefix='sales/sales-2017/',
    delimiter='.avro',
    gcp_conn_id=google_cloud_conn_id
)
template_fields :Iterable[str] = ['bucket', 'prefix', 'delimiter', 'impersonation_chain'][source]
ui_color = #f0eee4[source]
execute(self, context)[source]
class airflow.providers.google.cloud.operators.gcs.GCSDeleteObjectsOperator(*, bucket_name: str, objects: Optional[Iterable[str]] = None, prefix: Optional[str] = None, gcp_conn_id: str = 'google_cloud_default', google_cloud_storage_conn_id: Optional[str] = None, delegate_to: Optional[str] = None, impersonation_chain: Optional[Union[str, Sequence[str]]] = None, **kwargs)[source]

Bases: airflow.models.BaseOperator

Deletes objects from a Google Cloud Storage bucket, either from an explicit list of object names or all objects matching a prefix.

Parameters
  • bucket_name (str) -- The GCS bucket to delete from

  • objects (Iterable[str]) -- List of objects to delete. These should be the names of objects in the bucket, not including gs://bucket/

  • prefix -- Prefix of objects to delete. All objects matching this prefix in the bucket will be deleted.

  • gcp_conn_id (str) -- (Optional) The connection ID used to connect to Google Cloud.

  • google_cloud_storage_conn_id (str) -- (Deprecated) The connection ID used to connect to Google Cloud. This parameter has been deprecated. You should pass the gcp_conn_id parameter instead.

  • delegate_to (str) -- The account to impersonate using domain-wide delegation of authority, if any. For this to work, the service account making the request must have domain-wide delegation enabled.

  • impersonation_chain (Union[str, Sequence[str]]) -- Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

template_fields = ['bucket_name', 'prefix', 'objects', 'impersonation_chain'][source]
execute(self, context)[source]
class airflow.providers.google.cloud.operators.gcs.GCSBucketCreateAclEntryOperator(*, bucket: str, entity: str, role: str, user_project: Optional[str] = None, gcp_conn_id: str = 'google_cloud_default', google_cloud_storage_conn_id: Optional[str] = None, impersonation_chain: Optional[Union[str, Sequence[str]]] = None, **kwargs)[source]

Bases: airflow.models.BaseOperator

Creates a new ACL entry on the specified bucket.

See also

For more information on how to use this operator, take a look at the guide: GCSBucketCreateAclEntryOperator

Parameters
  • bucket (str) -- Name of a bucket.

  • entity (str) -- The entity holding the permission, in one of the following forms: user-userId, user-email, group-groupId, group-email, domain-domain, project-team-projectId, allUsers, allAuthenticatedUsers

  • role (str) -- The access permission for the entity. Acceptable values are: "OWNER", "READER", "WRITER".

  • user_project (str) -- (Optional) The project to be billed for this request. Required for Requester Pays buckets.

  • gcp_conn_id (str) -- (Optional) The connection ID used to connect to Google Cloud.

  • google_cloud_storage_conn_id (str) -- (Deprecated) The connection ID used to connect to Google Cloud. This parameter has been deprecated. You should pass the gcp_conn_id parameter instead.

  • impersonation_chain (Union[str, Sequence[str]]) -- Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

template_fields = ['bucket', 'entity', 'role', 'user_project', 'impersonation_chain'][source]
execute(self, context)[source]
class airflow.providers.google.cloud.operators.gcs.GCSObjectCreateAclEntryOperator(*, bucket: str, object_name: str, entity: str, role: str, generation: Optional[int] = None, user_project: Optional[str] = None, gcp_conn_id: str = 'google_cloud_default', google_cloud_storage_conn_id: Optional[str] = None, impersonation_chain: Optional[Union[str, Sequence[str]]] = None, **kwargs)[source]

Bases: airflow.models.BaseOperator

Creates a new ACL entry on the specified object.

See also

For more information on how to use this operator, take a look at the guide: GCSObjectCreateAclEntryOperator

Parameters
  • bucket (str) -- Name of a bucket.

  • object_name (str) -- Name of the object. For information about how to URL encode object names to be path safe, see: https://cloud.google.com/storage/docs/json_api/#encoding

  • entity (str) -- The entity holding the permission, in one of the following forms: user-userId, user-email, group-groupId, group-email, domain-domain, project-team-projectId, allUsers, allAuthenticatedUsers

  • role (str) -- The access permission for the entity. Acceptable values are: "OWNER", "READER".

  • generation (long) -- Optional. If present, selects a specific revision of this object.

  • user_project (str) -- (Optional) The project to be billed for this request. Required for Requester Pays buckets.

  • gcp_conn_id (str) -- (Optional) The connection ID used to connect to Google Cloud.

  • google_cloud_storage_conn_id (str) -- (Deprecated) The connection ID used to connect to Google Cloud. This parameter has been deprecated. You should pass the gcp_conn_id parameter instead.

  • impersonation_chain (Union[str, Sequence[str]]) -- Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

template_fields = ['bucket', 'object_name', 'entity', 'generation', 'role', 'user_project', 'impersonation_chain'][source]
execute(self, context)[source]
class airflow.providers.google.cloud.operators.gcs.GCSFileTransformOperator(*, source_bucket: str, source_object: str, transform_script: Union[str, List[str]], destination_bucket: Optional[str] = None, destination_object: Optional[str] = None, gcp_conn_id: str = 'google_cloud_default', impersonation_chain: Optional[Union[str, Sequence[str]]] = None, **kwargs)[source]

Bases: airflow.models.BaseOperator

Copies data from a source GCS location to a temporary location on the local filesystem. Runs a transformation on this file as specified by the transformation script and uploads the output to a destination bucket. If the output bucket is not specified the original file will be overwritten.

The locations of the source and the destination files in the local filesystem is provided as an first and second arguments to the transformation script. The transformation script is expected to read the data from source, transform it and write the output to the local destination file.

Parameters
  • source_bucket (str) -- The key to be retrieved from S3. (templated)

  • destination_bucket (str) -- The key to be written from S3. (templated)

  • transform_script (Union[str, List[str]]) -- location of the executable transformation script or list of arguments passed to subprocess ex. ['python', 'script.py', 10]. (templated)

  • gcp_conn_id (str) -- The connection ID to use connecting to Google Cloud.

  • impersonation_chain (Union[str, Sequence[str]]) -- Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

template_fields = ['source_bucket', 'destination_bucket', 'transform_script', 'impersonation_chain'][source]
execute(self, context: dict)[source]
class airflow.providers.google.cloud.operators.gcs.GCSTimeSpanFileTransformOperator(*, source_bucket: str, source_prefix: str, source_gcp_conn_id: str, destination_bucket: str, destination_prefix: str, destination_gcp_conn_id: str, transform_script: Union[str, List[str]], source_impersonation_chain: Optional[Union[str, Sequence[str]]] = None, destination_impersonation_chain: Optional[Union[str, Sequence[str]]] = None, chunk_size: Optional[int] = None, download_continue_on_fail: Optional[bool] = False, download_num_attempts: int = 1, upload_continue_on_fail: Optional[bool] = False, upload_num_attempts: int = 1, **kwargs)[source]

Bases: airflow.models.BaseOperator

Determines a list of objects that were added or modified at a GCS source location during a specific time-span, copies them to a temporary location on the local file system, runs a transform on this file as specified by the transformation script and uploads the output to the destination bucket.

See also

For more information on how to use this operator, take a look at the guide: GCSTimeSpanFileTransformOperator

The locations of the source and the destination files in the local filesystem is provided as an first and second arguments to the transformation script. The time-span is passed to the transform script as third and fourth argument as UTC ISO 8601 string.

The transformation script is expected to read the data from source, transform it and write the output to the local destination file.

Parameters
  • source_bucket (str) -- The bucket to fetch data from. (templated)

  • source_prefix (str) -- Prefix string which filters objects whose name begin with this prefix. Can interpolate execution date and time components. (templated)

  • source_gcp_conn_id (str) -- The connection ID to use connecting to Google Cloud to download files to be processed.

  • source_impersonation_chain (Union[str, Sequence[str]]) -- Optional service account to impersonate using short-term credentials (to download files to be processed), or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

  • destination_bucket (str) -- The bucket to write data to. (templated)

  • destination_prefix (str) -- Prefix string for the upload location. Can interpolate execution date and time components. (templated)

  • destination_gcp_conn_id (str) -- The connection ID to use connecting to Google Cloud to upload processed files.

  • destination_impersonation_chain (Union[str, Sequence[str]]) -- Optional service account to impersonate using short-term credentials (to upload processed files), or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

  • transform_script (Union[str, List[str]]) -- location of the executable transformation script or list of arguments passed to subprocess ex. ['python', 'script.py', 10]. (templated)

  • chunk_size (Optional[int]) -- The size of a chunk of data when downloading or uploading (in bytes). This must be a multiple of 256 KB (per the google clout storage API specification).

  • download_continue_on_fail -- With this set to true, if a download fails the task does not error out but will still continue.

  • upload_chunk_size -- The size of a chunk of data when uploading (in bytes). This must be a multiple of 256 KB (per the google clout storage API specification).

  • upload_continue_on_fail -- With this set to true, if an upload fails the task does not error out but will still continue.

  • upload_num_attempts (int) -- Number of attempts to try to upload a single file.

template_fields = ['source_bucket', 'source_prefix', 'destination_bucket', 'destination_prefix', 'transform_script', 'source_impersonation_chain', 'destination_impersonation_chain'][source]
static interpolate_prefix(prefix: str, dt: datetime.datetime)[source]

Interpolate prefix with datetime.

Parameters
  • prefix (str) -- The prefix to interpolate

  • dt (datetime) -- The datetime to interpolate

execute(self, context: dict)[source]
class airflow.providers.google.cloud.operators.gcs.GCSDeleteBucketOperator(*, bucket_name: str, force: bool = True, gcp_conn_id: str = 'google_cloud_default', impersonation_chain: Optional[Union[str, Sequence[str]]] = None, **kwargs)[source]

Bases: airflow.models.BaseOperator

Deletes bucket from a Google Cloud Storage.

See also

For more information on how to use this operator, take a look at the guide: Deleting Bucket

Parameters
  • bucket_name (str) -- name of the bucket which will be deleted

  • force -- false not allow to delete non empty bucket, set force=True allows to delete non empty bucket

  • gcp_conn_id (str) -- The connection ID to use connecting to Google Cloud.

  • impersonation_chain (Union[str, Sequence[str]]) -- Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

Type

bool

template_fields = ['bucket_name', 'gcp_conn_id', 'impersonation_chain'][source]
execute(self, context)[source]
class airflow.providers.google.cloud.operators.gcs.GCSSynchronizeBucketsOperator(*, source_bucket: str, destination_bucket: str, source_object: Optional[str] = None, destination_object: Optional[str] = None, recursive: bool = True, delete_extra_files: bool = False, allow_overwrite: bool = False, gcp_conn_id: str = 'google_cloud_default', delegate_to: Optional[str] = None, impersonation_chain: Optional[Union[str, Sequence[str]]] = None, **kwargs)[source]

Bases: airflow.models.BaseOperator

Synchronizes the contents of the buckets or bucket's directories in the Google Cloud Services.

Parameters source_object and destination_object describe the root sync directory. If they are not passed, the entire bucket will be synchronized. They should point to directories.

Note

The synchronization of individual files is not supported. Only entire directories can be synchronized.

See also

For more information on how to use this operator, take a look at the guide: GCSSynchronizeBuckets

Parameters
  • source_bucket (str) -- The name of the bucket containing the source objects.

  • destination_bucket (str) -- The name of the bucket containing the destination objects.

  • source_object (Optional[str]) -- The root sync directory in the source bucket.

  • destination_object (Optional[str]) -- The root sync directory in the destination bucket.

  • recursive (bool) -- If True, subdirectories will be considered

  • allow_overwrite (bool) -- if True, the files will be overwritten if a mismatched file is found. By default, overwriting files is not allowed

  • delete_extra_files (bool) --

    if True, deletes additional files from the source that not found in the destination. By default extra files are not deleted.

    Note

    This option can delete data quickly if you specify the wrong source/destination combination.

  • gcp_conn_id (str) -- (Optional) The connection ID used to connect to Google Cloud.

  • delegate_to (str) -- The account to impersonate using domain-wide delegation of authority, if any. For this to work, the service account making the request must have domain-wide delegation enabled.

  • impersonation_chain (Union[str, Sequence[str]]) -- Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

template_fields = ['source_bucket', 'destination_bucket', 'source_object', 'destination_object', 'recursive', 'delete_extra_files', 'allow_overwrite', 'gcp_conn_id', 'delegate_to', 'impersonation_chain'][source]
execute(self, context)[source]

Was this entry helpful?