airflow.providers.google.cloud.operators.gcs
¶
This module contains a Google Cloud Storage Bucket operator.
Module Contents¶
Classes¶
Creates a new bucket. |
|
List all objects from the bucket filtered by given string prefix and delimiter in name or match_glob. |
|
Deletes objects from a list or all objects matching a prefix from a Google Cloud Storage bucket. |
|
Creates a new ACL entry on the specified bucket. |
|
Creates a new ACL entry on the specified object. |
|
Copies data from a source GCS location to a temporary location on the local filesystem. |
|
Copy objects that were modified during a time span, run a transform, and upload results to a bucket. |
|
Deletes bucket from a Google Cloud Storage. |
|
Synchronizes the contents of the buckets or bucket's directories in the Google Cloud Services. |
- class airflow.providers.google.cloud.operators.gcs.GCSCreateBucketOperator(*, bucket_name, resource=None, storage_class='MULTI_REGIONAL', location='US', project_id=None, labels=None, gcp_conn_id='google_cloud_default', impersonation_chain=None, **kwargs)[source]¶
Bases:
airflow.providers.google.cloud.operators.cloud_base.GoogleCloudBaseOperator
Creates a new bucket.
Google Cloud Storage uses a flat namespace, so you can’t create a bucket with a name that is already in use.
See also
For more information, see Bucket Naming Guidelines: https://cloud.google.com/storage/docs/bucketnaming.html#requirements
- Parameters
bucket_name (str) – The name of the bucket. (templated)
resource (dict | None) – An optional dict with parameters for creating the bucket. For information on available parameters, see Cloud Storage API doc: https://cloud.google.com/storage/docs/json_api/v1/buckets/insert
storage_class (str) –
This defines how objects in the bucket are stored and determines the SLA and the cost of storage (templated). Values include
MULTI_REGIONAL
REGIONAL
STANDARD
NEARLINE
COLDLINE
.
If this value is not specified when the bucket is created, it will default to STANDARD.
location (str) –
The location of the bucket. (templated) Object data for objects in the bucket resides in physical storage within this region. Defaults to US.
project_id (str | None) – The ID of the Google Cloud Project. (templated)
labels (dict | None) – User-provided labels, in key/value pairs.
gcp_conn_id (str) – (Optional) The connection ID used to connect to Google Cloud.
impersonation_chain (str | Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).
The following Operator would create a new bucket
test-bucket
withMULTI_REGIONAL
storage class inEU
regionCreateBucket = GoogleCloudStorageCreateBucketOperator( task_id="CreateNewBucket", bucket_name="test-bucket", storage_class="MULTI_REGIONAL", location="EU", labels={"env": "dev", "team": "airflow"}, gcp_conn_id="airflow-conn-id", )
- class airflow.providers.google.cloud.operators.gcs.GCSListObjectsOperator(*, bucket, prefix=None, delimiter=None, gcp_conn_id='google_cloud_default', impersonation_chain=None, match_glob=None, **kwargs)[source]¶
Bases:
airflow.providers.google.cloud.operators.cloud_base.GoogleCloudBaseOperator
List all objects from the bucket filtered by given string prefix and delimiter in name or match_glob.
This operator returns a python list with the name of objects which can be used by XCom in the downstream task.
- Parameters
bucket (str) – The Google Cloud Storage bucket to find the objects. (templated)
prefix (str | list[str] | None) – String or list of strings, which filter objects whose name begins with it/them. (templated)
delimiter (str | None) – (Deprecated) The delimiter by which you want to filter the objects. (templated) For example, to list the CSV files from in a directory in GCS you would use delimiter=’.csv’.
gcp_conn_id (str) – (Optional) The connection ID used to connect to Google Cloud.
impersonation_chain (str | Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).
match_glob (str | None) – (Optional) filters objects based on the glob pattern given by the string (e.g,
'**/*/.json'
)
- Example:
The following Operator would list all the Avro files from
sales/sales-2017
folder indata
bucket.GCS_Files = GoogleCloudStorageListOperator( task_id='GCS_Files', bucket='data', prefix='sales/sales-2017/', match_glob='**/*/.avro', gcp_conn_id=google_cloud_conn_id )
- class airflow.providers.google.cloud.operators.gcs.GCSDeleteObjectsOperator(*, bucket_name, objects=None, prefix=None, gcp_conn_id='google_cloud_default', impersonation_chain=None, **kwargs)[source]¶
Bases:
airflow.providers.google.cloud.operators.cloud_base.GoogleCloudBaseOperator
Deletes objects from a list or all objects matching a prefix from a Google Cloud Storage bucket.
- Parameters
bucket_name (str) – The GCS bucket to delete from
objects (list[str] | None) – List of objects to delete. These should be the names of objects in the bucket, not including gs://bucket/
prefix (str | None) – String or list of strings, which filter objects whose name begin with it/them. (templated)
gcp_conn_id (str) – (Optional) The connection ID used to connect to Google Cloud.
impersonation_chain (str | Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).
- class airflow.providers.google.cloud.operators.gcs.GCSBucketCreateAclEntryOperator(*, bucket, entity, role, user_project=None, gcp_conn_id='google_cloud_default', impersonation_chain=None, **kwargs)[source]¶
Bases:
airflow.providers.google.cloud.operators.cloud_base.GoogleCloudBaseOperator
Creates a new ACL entry on the specified bucket.
See also
For more information on how to use this operator, take a look at the guide: GCSBucketCreateAclEntryOperator
- Parameters
bucket (str) – Name of a bucket.
entity (str) – The entity holding the permission, in one of the following forms: user-userId, user-email, group-groupId, group-email, domain-domain, project-team-projectId, allUsers, allAuthenticatedUsers
role (str) – The access permission for the entity. Acceptable values are: “OWNER”, “READER”, “WRITER”.
user_project (str | None) – (Optional) The project to be billed for this request. Required for Requester Pays buckets.
gcp_conn_id (str) – (Optional) The connection ID used to connect to Google Cloud.
impersonation_chain (str | Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).
- class airflow.providers.google.cloud.operators.gcs.GCSObjectCreateAclEntryOperator(*, bucket, object_name, entity, role, generation=None, user_project=None, gcp_conn_id='google_cloud_default', impersonation_chain=None, **kwargs)[source]¶
Bases:
airflow.providers.google.cloud.operators.cloud_base.GoogleCloudBaseOperator
Creates a new ACL entry on the specified object.
See also
For more information on how to use this operator, take a look at the guide: GCSObjectCreateAclEntryOperator
- Parameters
bucket (str) – Name of a bucket.
object_name (str) – Name of the object. For information about how to URL encode object names to be path safe, see: https://cloud.google.com/storage/docs/json_api/#encoding
entity (str) – The entity holding the permission, in one of the following forms: user-userId, user-email, group-groupId, group-email, domain-domain, project-team-projectId, allUsers, allAuthenticatedUsers
role (str) – The access permission for the entity. Acceptable values are: “OWNER”, “READER”.
generation (int | None) – Optional. If present, selects a specific revision of this object.
user_project (str | None) – (Optional) The project to be billed for this request. Required for Requester Pays buckets.
gcp_conn_id (str) – (Optional) The connection ID used to connect to Google Cloud.
impersonation_chain (str | Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).
- class airflow.providers.google.cloud.operators.gcs.GCSFileTransformOperator(*, source_bucket, source_object, transform_script, destination_bucket=None, destination_object=None, gcp_conn_id='google_cloud_default', impersonation_chain=None, **kwargs)[source]¶
Bases:
airflow.providers.google.cloud.operators.cloud_base.GoogleCloudBaseOperator
Copies data from a source GCS location to a temporary location on the local filesystem.
Runs a transformation on this file as specified by the transformation script and uploads the output to a destination bucket. If the output bucket is not specified the original file will be overwritten.
The locations of the source and the destination files in the local filesystem is provided as an first and second arguments to the transformation script. The transformation script is expected to read the data from source, transform it and write the output to the local destination file.
- Parameters
source_bucket (str) – The bucket to locate the source_object. (templated)
source_object (str) – The key to be retrieved from GCS. (templated)
destination_bucket (str | None) – The bucket to upload the key after transformation. If not provided, source_bucket will be used. (templated)
destination_object (str | None) – The key to be written in GCS. If not provided, source_object will be used. (templated)
transform_script (str | list[str]) – location of the executable transformation script or list of arguments passed to subprocess ex. [‘python’, ‘script.py’, 10]. (templated)
gcp_conn_id (str) – The connection ID to use connecting to Google Cloud.
impersonation_chain (str | Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).
- class airflow.providers.google.cloud.operators.gcs.GCSTimeSpanFileTransformOperator(*, source_bucket, source_prefix, source_gcp_conn_id, destination_bucket, destination_prefix, destination_gcp_conn_id, transform_script, source_impersonation_chain=None, destination_impersonation_chain=None, chunk_size=None, download_continue_on_fail=False, download_num_attempts=1, upload_continue_on_fail=False, upload_num_attempts=1, **kwargs)[source]¶
Bases:
airflow.providers.google.cloud.operators.cloud_base.GoogleCloudBaseOperator
Copy objects that were modified during a time span, run a transform, and upload results to a bucket.
Determines a list of objects that were added or modified at a GCS source location during a specific time-span, copies them to a temporary location on the local file system, runs a transform on this file as specified by the transformation script and uploads the output to the destination bucket.
See also
For more information on how to use this operator, take a look at the guide: GCSTimeSpanFileTransformOperator
The locations of the source and the destination files in the local filesystem is provided as an first and second arguments to the transformation script. The time-span is passed to the transform script as third and fourth argument as UTC ISO 8601 string.
The transformation script is expected to read the data from source, transform it and write the output to the local destination file.
- Parameters
source_bucket (str) – The bucket to fetch data from. (templated)
source_prefix (str) – Prefix string which filters objects whose name begin with this prefix. Can interpolate execution date and time components. (templated)
source_gcp_conn_id (str) – The connection ID to use connecting to Google Cloud to download files to be processed.
source_impersonation_chain (str | Sequence[str] | None) – Optional service account to impersonate using short-term credentials (to download files to be processed), or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).
destination_bucket (str) – The bucket to write data to. (templated)
destination_prefix (str) – Prefix string for the upload location. Can interpolate execution date and time components. (templated)
destination_gcp_conn_id (str) – The connection ID to use connecting to Google Cloud to upload processed files.
destination_impersonation_chain (str | Sequence[str] | None) – Optional service account to impersonate using short-term credentials (to upload processed files), or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).
transform_script (str | list[str]) – location of the executable transformation script or list of arguments passed to subprocess ex. [‘python’, ‘script.py’, 10]. (templated)
chunk_size (int | None) – The size of a chunk of data when downloading or uploading (in bytes). This must be a multiple of 256 KB (per the google clout storage API specification).
download_continue_on_fail (bool | None) – With this set to true, if a download fails the task does not error out but will still continue.
upload_chunk_size – The size of a chunk of data when uploading (in bytes). This must be a multiple of 256 KB (per the google clout storage API specification).
upload_continue_on_fail (bool | None) – With this set to true, if an upload fails the task does not error out but will still continue.
upload_num_attempts (int) – Number of attempts to try to upload a single file.
- template_fields: Sequence[str] = ('source_bucket', 'source_prefix', 'destination_bucket', 'destination_prefix',...[source]¶
- static interpolate_prefix(prefix, dt)[source]¶
Interpolate prefix with datetime.
- Parameters
prefix (str) – The prefix to interpolate
dt (datetime.datetime) – The datetime to interpolate
- class airflow.providers.google.cloud.operators.gcs.GCSDeleteBucketOperator(*, bucket_name, force=True, gcp_conn_id='google_cloud_default', impersonation_chain=None, user_project=None, **kwargs)[source]¶
Bases:
airflow.providers.google.cloud.operators.cloud_base.GoogleCloudBaseOperator
Deletes bucket from a Google Cloud Storage.
See also
For more information on how to use this operator, take a look at the guide: Deleting Bucket
- Parameters
bucket_name (str) – name of the bucket which will be deleted
force (bool) – false not allow to delete non empty bucket, set force=True allows to delete non empty bucket
gcp_conn_id (str) – The connection ID to use connecting to Google Cloud.
impersonation_chain (str | Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).
user_project (str | None) – (Optional) The identifier of the project to bill for this request. Required for Requester Pays buckets.
- class airflow.providers.google.cloud.operators.gcs.GCSSynchronizeBucketsOperator(*, source_bucket, destination_bucket, source_object=None, destination_object=None, recursive=True, delete_extra_files=False, allow_overwrite=False, gcp_conn_id='google_cloud_default', impersonation_chain=None, **kwargs)[source]¶
Bases:
airflow.providers.google.cloud.operators.cloud_base.GoogleCloudBaseOperator
Synchronizes the contents of the buckets or bucket’s directories in the Google Cloud Services.
Parameters
source_object
anddestination_object
describe the root sync directory. If they are not passed, the entire bucket will be synchronized. They should point to directories.Note
The synchronization of individual files is not supported. Only entire directories can be synchronized.
See also
For more information on how to use this operator, take a look at the guide: GCSSynchronizeBuckets
- Parameters
source_bucket (str) – The name of the bucket containing the source objects.
destination_bucket (str) – The name of the bucket containing the destination objects.
source_object (str | None) – The root sync directory in the source bucket.
destination_object (str | None) – The root sync directory in the destination bucket.
recursive (bool) – If True, subdirectories will be considered
allow_overwrite (bool) – if True, the files will be overwritten if a mismatched file is found. By default, overwriting files is not allowed
delete_extra_files (bool) –
if True, deletes additional files from the source that not found in the destination. By default extra files are not deleted.
Note
This option can delete data quickly if you specify the wrong source/destination combination.
gcp_conn_id (str) – (Optional) The connection ID used to connect to Google Cloud.
impersonation_chain (str | Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).