airflow.providers.google.cloud.operators.gcs¶
This module contains a Google Cloud Storage Bucket operator.
Module Contents¶
Classes¶
| Creates a new bucket. | |
| List all objects from the bucket filtered by given string prefix and delimiter in name or match_glob. | |
| Deletes objects from a list or all objects matching a prefix from a Google Cloud Storage bucket. | |
| Creates a new ACL entry on the specified bucket. | |
| Creates a new ACL entry on the specified object. | |
| Copies data from a source GCS location to a temporary location on the local filesystem. | |
| Copy objects that were modified during a time span, run a transform, and upload results to a bucket. | |
| Deletes bucket from a Google Cloud Storage. | |
| Synchronizes the contents of the buckets or bucket's directories in the Google Cloud Services. | 
- class airflow.providers.google.cloud.operators.gcs.GCSCreateBucketOperator(*, bucket_name, resource=None, storage_class='MULTI_REGIONAL', location='US', project_id=None, labels=None, gcp_conn_id='google_cloud_default', impersonation_chain=None, **kwargs)[source]¶
- Bases: - airflow.providers.google.cloud.operators.cloud_base.GoogleCloudBaseOperator- Creates a new bucket. - Google Cloud Storage uses a flat namespace, so you can’t create a bucket with a name that is already in use. - See also - For more information, see Bucket Naming Guidelines: https://cloud.google.com/storage/docs/bucketnaming.html#requirements - Parameters
- bucket_name (str) – The name of the bucket. (templated) 
- resource (dict | None) – An optional dict with parameters for creating the bucket. For information on available parameters, see Cloud Storage API doc: https://cloud.google.com/storage/docs/json_api/v1/buckets/insert 
- storage_class (str) – - This defines how objects in the bucket are stored and determines the SLA and the cost of storage (templated). Values include - MULTI_REGIONAL
- REGIONAL
- STANDARD
- NEARLINE
- COLDLINE.
 - If this value is not specified when the bucket is created, it will default to STANDARD. 
- location (str) – - The location of the bucket. (templated) Object data for objects in the bucket resides in physical storage within this region. Defaults to US. 
- project_id (str | None) – The ID of the Google Cloud Project. (templated) 
- labels (dict | None) – User-provided labels, in key/value pairs. 
- gcp_conn_id (str) – (Optional) The connection ID used to connect to Google Cloud. 
- impersonation_chain (str | Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated). 
 
 - The following Operator would create a new bucket - test-bucketwith- MULTI_REGIONALstorage class in- EUregion- CreateBucket = GoogleCloudStorageCreateBucketOperator( task_id="CreateNewBucket", bucket_name="test-bucket", storage_class="MULTI_REGIONAL", location="EU", labels={"env": "dev", "team": "airflow"}, gcp_conn_id="airflow-conn-id", ) 
- class airflow.providers.google.cloud.operators.gcs.GCSListObjectsOperator(*, bucket, prefix=None, delimiter=None, gcp_conn_id='google_cloud_default', impersonation_chain=None, match_glob=None, **kwargs)[source]¶
- Bases: - airflow.providers.google.cloud.operators.cloud_base.GoogleCloudBaseOperator- List all objects from the bucket filtered by given string prefix and delimiter in name or match_glob. - This operator returns a python list with the name of objects which can be used by XCom in the downstream task. - Parameters
- bucket (str) – The Google Cloud Storage bucket to find the objects. (templated) 
- prefix (str | list[str] | None) – String or list of strings, which filter objects whose name begins with it/them. (templated) 
- delimiter (str | None) – (Deprecated) The delimiter by which you want to filter the objects. (templated) For example, to list the CSV files from in a directory in GCS you would use delimiter=’.csv’. 
- gcp_conn_id (str) – (Optional) The connection ID used to connect to Google Cloud. 
- impersonation_chain (str | Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated). 
- match_glob (str | None) – (Optional) filters objects based on the glob pattern given by the string (e.g, - '**/*/.json')
 
 - Example:
- The following Operator would list all the Avro files from - sales/sales-2017folder in- databucket.- GCS_Files = GoogleCloudStorageListOperator( task_id='GCS_Files', bucket='data', prefix='sales/sales-2017/', match_glob='**/*/.avro', gcp_conn_id=google_cloud_conn_id ) 
 
- class airflow.providers.google.cloud.operators.gcs.GCSDeleteObjectsOperator(*, bucket_name, objects=None, prefix=None, gcp_conn_id='google_cloud_default', impersonation_chain=None, **kwargs)[source]¶
- Bases: - airflow.providers.google.cloud.operators.cloud_base.GoogleCloudBaseOperator- Deletes objects from a list or all objects matching a prefix from a Google Cloud Storage bucket. - Parameters
- bucket_name (str) – The GCS bucket to delete from 
- objects (list[str] | None) – List of objects to delete. These should be the names of objects in the bucket, not including gs://bucket/ 
- prefix (str | None) – String or list of strings, which filter objects whose name begin with it/them. (templated) 
- gcp_conn_id (str) – (Optional) The connection ID used to connect to Google Cloud. 
- impersonation_chain (str | Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated). 
 
 
- class airflow.providers.google.cloud.operators.gcs.GCSBucketCreateAclEntryOperator(*, bucket, entity, role, user_project=None, gcp_conn_id='google_cloud_default', impersonation_chain=None, **kwargs)[source]¶
- Bases: - airflow.providers.google.cloud.operators.cloud_base.GoogleCloudBaseOperator- Creates a new ACL entry on the specified bucket. - See also - For more information on how to use this operator, take a look at the guide: GCSBucketCreateAclEntryOperator - Parameters
- bucket (str) – Name of a bucket. 
- entity (str) – The entity holding the permission, in one of the following forms: user-userId, user-email, group-groupId, group-email, domain-domain, project-team-projectId, allUsers, allAuthenticatedUsers 
- role (str) – The access permission for the entity. Acceptable values are: “OWNER”, “READER”, “WRITER”. 
- user_project (str | None) – (Optional) The project to be billed for this request. Required for Requester Pays buckets. 
- gcp_conn_id (str) – (Optional) The connection ID used to connect to Google Cloud. 
- impersonation_chain (str | Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated). 
 
 
- class airflow.providers.google.cloud.operators.gcs.GCSObjectCreateAclEntryOperator(*, bucket, object_name, entity, role, generation=None, user_project=None, gcp_conn_id='google_cloud_default', impersonation_chain=None, **kwargs)[source]¶
- Bases: - airflow.providers.google.cloud.operators.cloud_base.GoogleCloudBaseOperator- Creates a new ACL entry on the specified object. - See also - For more information on how to use this operator, take a look at the guide: GCSObjectCreateAclEntryOperator - Parameters
- bucket (str) – Name of a bucket. 
- object_name (str) – Name of the object. For information about how to URL encode object names to be path safe, see: https://cloud.google.com/storage/docs/json_api/#encoding 
- entity (str) – The entity holding the permission, in one of the following forms: user-userId, user-email, group-groupId, group-email, domain-domain, project-team-projectId, allUsers, allAuthenticatedUsers 
- role (str) – The access permission for the entity. Acceptable values are: “OWNER”, “READER”. 
- generation (int | None) – Optional. If present, selects a specific revision of this object. 
- user_project (str | None) – (Optional) The project to be billed for this request. Required for Requester Pays buckets. 
- gcp_conn_id (str) – (Optional) The connection ID used to connect to Google Cloud. 
- impersonation_chain (str | Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated). 
 
 
- class airflow.providers.google.cloud.operators.gcs.GCSFileTransformOperator(*, source_bucket, source_object, transform_script, destination_bucket=None, destination_object=None, gcp_conn_id='google_cloud_default', impersonation_chain=None, **kwargs)[source]¶
- Bases: - airflow.providers.google.cloud.operators.cloud_base.GoogleCloudBaseOperator- Copies data from a source GCS location to a temporary location on the local filesystem. - Runs a transformation on this file as specified by the transformation script and uploads the output to a destination bucket. If the output bucket is not specified the original file will be overwritten. - The locations of the source and the destination files in the local filesystem is provided as an first and second arguments to the transformation script. The transformation script is expected to read the data from source, transform it and write the output to the local destination file. - Parameters
- source_bucket (str) – The bucket to locate the source_object. (templated) 
- source_object (str) – The key to be retrieved from GCS. (templated) 
- destination_bucket (str | None) – The bucket to upload the key after transformation. If not provided, source_bucket will be used. (templated) 
- destination_object (str | None) – The key to be written in GCS. If not provided, source_object will be used. (templated) 
- transform_script (str | list[str]) – location of the executable transformation script or list of arguments passed to subprocess ex. [‘python’, ‘script.py’, 10]. (templated) 
- gcp_conn_id (str) – The connection ID to use connecting to Google Cloud. 
- impersonation_chain (str | Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated). 
 
 
- class airflow.providers.google.cloud.operators.gcs.GCSTimeSpanFileTransformOperator(*, source_bucket, source_prefix, source_gcp_conn_id, destination_bucket, destination_prefix, destination_gcp_conn_id, transform_script, source_impersonation_chain=None, destination_impersonation_chain=None, chunk_size=None, download_continue_on_fail=False, download_num_attempts=1, upload_continue_on_fail=False, upload_num_attempts=1, **kwargs)[source]¶
- Bases: - airflow.providers.google.cloud.operators.cloud_base.GoogleCloudBaseOperator- Copy objects that were modified during a time span, run a transform, and upload results to a bucket. - Determines a list of objects that were added or modified at a GCS source location during a specific time-span, copies them to a temporary location on the local file system, runs a transform on this file as specified by the transformation script and uploads the output to the destination bucket. - See also - For more information on how to use this operator, take a look at the guide: GCSTimeSpanFileTransformOperator - The locations of the source and the destination files in the local filesystem is provided as an first and second arguments to the transformation script. The time-span is passed to the transform script as third and fourth argument as UTC ISO 8601 string. - The transformation script is expected to read the data from source, transform it and write the output to the local destination file. - Parameters
- source_bucket (str) – The bucket to fetch data from. (templated) 
- source_prefix (str) – Prefix string which filters objects whose name begin with this prefix. Can interpolate execution date and time components. (templated) 
- source_gcp_conn_id (str) – The connection ID to use connecting to Google Cloud to download files to be processed. 
- source_impersonation_chain (str | Sequence[str] | None) – Optional service account to impersonate using short-term credentials (to download files to be processed), or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated). 
- destination_bucket (str) – The bucket to write data to. (templated) 
- destination_prefix (str) – Prefix string for the upload location. Can interpolate execution date and time components. (templated) 
- destination_gcp_conn_id (str) – The connection ID to use connecting to Google Cloud to upload processed files. 
- destination_impersonation_chain (str | Sequence[str] | None) – Optional service account to impersonate using short-term credentials (to upload processed files), or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated). 
- transform_script (str | list[str]) – location of the executable transformation script or list of arguments passed to subprocess ex. [‘python’, ‘script.py’, 10]. (templated) 
- chunk_size (int | None) – The size of a chunk of data when downloading or uploading (in bytes). This must be a multiple of 256 KB (per the google clout storage API specification). 
- download_continue_on_fail (bool | None) – With this set to true, if a download fails the task does not error out but will still continue. 
- upload_chunk_size – The size of a chunk of data when uploading (in bytes). This must be a multiple of 256 KB (per the google clout storage API specification). 
- upload_continue_on_fail (bool | None) – With this set to true, if an upload fails the task does not error out but will still continue. 
- upload_num_attempts (int) – Number of attempts to try to upload a single file. 
 
 - template_fields: Sequence[str] = ('source_bucket', 'source_prefix', 'destination_bucket', 'destination_prefix',...[source]¶
 - static interpolate_prefix(prefix, dt)[source]¶
- Interpolate prefix with datetime. - Parameters
- prefix (str) – The prefix to interpolate 
- dt (datetime.datetime) – The datetime to interpolate 
 
 
 
- class airflow.providers.google.cloud.operators.gcs.GCSDeleteBucketOperator(*, bucket_name, force=True, gcp_conn_id='google_cloud_default', impersonation_chain=None, user_project=None, **kwargs)[source]¶
- Bases: - airflow.providers.google.cloud.operators.cloud_base.GoogleCloudBaseOperator- Deletes bucket from a Google Cloud Storage. - See also - For more information on how to use this operator, take a look at the guide: Deleting Bucket - Parameters
- bucket_name (str) – name of the bucket which will be deleted 
- force (bool) – false not allow to delete non empty bucket, set force=True allows to delete non empty bucket 
- gcp_conn_id (str) – The connection ID to use connecting to Google Cloud. 
- impersonation_chain (str | Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated). 
- user_project (str | None) – (Optional) The identifier of the project to bill for this request. Required for Requester Pays buckets. 
 
 
- class airflow.providers.google.cloud.operators.gcs.GCSSynchronizeBucketsOperator(*, source_bucket, destination_bucket, source_object=None, destination_object=None, recursive=True, delete_extra_files=False, allow_overwrite=False, gcp_conn_id='google_cloud_default', impersonation_chain=None, **kwargs)[source]¶
- Bases: - airflow.providers.google.cloud.operators.cloud_base.GoogleCloudBaseOperator- Synchronizes the contents of the buckets or bucket’s directories in the Google Cloud Services. - Parameters - source_objectand- destination_objectdescribe the root sync directory. If they are not passed, the entire bucket will be synchronized. They should point to directories.- Note - The synchronization of individual files is not supported. Only entire directories can be synchronized. - See also - For more information on how to use this operator, take a look at the guide: GCSSynchronizeBuckets - Parameters
- source_bucket (str) – The name of the bucket containing the source objects. 
- destination_bucket (str) – The name of the bucket containing the destination objects. 
- source_object (str | None) – The root sync directory in the source bucket. 
- destination_object (str | None) – The root sync directory in the destination bucket. 
- recursive (bool) – If True, subdirectories will be considered 
- allow_overwrite (bool) – if True, the files will be overwritten if a mismatched file is found. By default, overwriting files is not allowed 
- delete_extra_files (bool) – - if True, deletes additional files from the source that not found in the destination. By default extra files are not deleted. - Note - This option can delete data quickly if you specify the wrong source/destination combination. 
- gcp_conn_id (str) – (Optional) The connection ID used to connect to Google Cloud. 
- impersonation_chain (str | Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated). 
 
 
