airflow.providers.google.cloud.hooks.gcs¶
This module contains a Google Cloud Storage hook.
Module Contents¶
Classes¶
| Use the Google Cloud connection to interact with Google Cloud Storage. | |
| GCSAsyncHook run on the trigger worker, inherits from GoogleBaseAsyncHook. | 
Functions¶
| 
 | Return True if given Google Cloud Storage URL (gs://<bucket>/<blob>) is a directory or empty bucket. | 
| 
 | Download and parses json file from Google cloud Storage. | 
Attributes¶
- class airflow.providers.google.cloud.hooks.gcs.GCSHook(gcp_conn_id='google_cloud_default', impersonation_chain=None, **kwargs)[source]¶
- Bases: - airflow.providers.google.common.hooks.base_google.GoogleBaseHook- Use the Google Cloud connection to interact with Google Cloud Storage. - copy(source_bucket, source_object, destination_bucket=None, destination_object=None)[source]¶
- Copy an object from a bucket to another, with renaming if requested. - destination_bucket or destination_object can be omitted, in which case source bucket/object is used, but not both. - Parameters
- source_bucket (str) – The bucket of the object to copy from. 
- source_object (str) – The object to copy. 
- destination_bucket (str | None) – The destination of the object to copied to. Can be omitted; then the same bucket is used. 
- destination_object (str | None) – The (renamed) path of the object if given. Can be omitted; then the same name is used. 
 
 
 - rewrite(source_bucket, source_object, destination_bucket, destination_object=None)[source]¶
- Similar to copy; supports files over 5 TB, and copying between locations and/or storage classes. - destination_object can be omitted, in which case source_object is used. - Parameters
 
 - download(bucket_name: str, object_name: str, filename: None = None, chunk_size: int | None = None, timeout: int | None = DEFAULT_TIMEOUT, num_max_attempts: int | None = 1, user_project: str | None = None) bytes[source]¶
- download(bucket_name: str, object_name: str, filename: str, chunk_size: int | None = None, timeout: int | None = DEFAULT_TIMEOUT, num_max_attempts: int | None = 1, user_project: str | None = None) str
- Download a file from Google Cloud Storage. - When no filename is supplied, the operator loads the file into memory and returns its content. When a filename is supplied, it writes the file to the specified location and returns the location. For file sizes that exceed the available memory it is recommended to write to a file. - Parameters
- bucket_name – The bucket to fetch from. 
- object_name – The object to fetch. 
- filename – If set, a local file path where the file should be written to. 
- chunk_size – Blob chunk size. 
- timeout – Request timeout in seconds. 
- num_max_attempts – Number of attempts to download the file. 
- user_project – The identifier of the Google Cloud project to bill for the request. Required for Requester Pays buckets. 
 
 
 - download_as_byte_array(bucket_name, object_name, chunk_size=None, timeout=DEFAULT_TIMEOUT, num_max_attempts=1)[source]¶
- Download a file from Google Cloud Storage. - When no filename is supplied, the operator loads the file into memory and returns its content. When a filename is supplied, it writes the file to the specified location and returns the location. For file sizes that exceed the available memory it is recommended to write to a file. 
 - provide_file(bucket_name=PROVIDE_BUCKET, object_name=None, object_url=None, dir=None, user_project=None)[source]¶
- Download the file to a temporary directory and returns a file handle. - You can use this method by passing the bucket_name and object_name parameters or just object_url parameter. - Parameters
- bucket_name (str) – The bucket to fetch from. 
- object_name (str | None) – The object to fetch. 
- object_url (str | None) – File reference url. Must start with “gs: //” 
- dir (str | None) – The tmp sub directory to download the file to. (passed to NamedTemporaryFile) 
- user_project (str | None) – The identifier of the Google Cloud project to bill for the request. Required for Requester Pays buckets. 
 
- Returns
- File handler 
- Return type
- Generator[IO[bytes], None, None] 
 
 - provide_file_and_upload(bucket_name=PROVIDE_BUCKET, object_name=None, object_url=None, user_project=None)[source]¶
- Create temporary file, returns a file handle and uploads the files content on close. - You can use this method by passing the bucket_name and object_name parameters or just object_url parameter. - Parameters
- Returns
- File handler 
- Return type
- Generator[IO[bytes], None, None] 
 
 - upload(bucket_name, object_name, filename=None, data=None, mime_type=None, gzip=False, encoding='utf-8', chunk_size=None, timeout=DEFAULT_TIMEOUT, num_max_attempts=1, metadata=None, cache_control=None, user_project=None)[source]¶
- Upload a local file or file data as string or bytes to Google Cloud Storage. - Parameters
- bucket_name (str) – The bucket to upload to. 
- object_name (str) – The object name to set when uploading the file. 
- filename (str | None) – The local file path to the file to be uploaded. 
- data (str | bytes | None) – The file’s data as a string or bytes to be uploaded. 
- mime_type (str | None) – The file’s mime type set when uploading the file. 
- gzip (bool) – Option to compress local file or file data for upload 
- encoding (str) – bytes encoding for file data if provided as string 
- chunk_size (int | None) – Blob chunk size. 
- timeout (int | None) – Request timeout in seconds. 
- num_max_attempts (int) – Number of attempts to try to upload the file. 
- metadata (dict | None) – The metadata to be uploaded with the file. 
- cache_control (str | None) – Cache-Control metadata field. 
- user_project (str | None) – The identifier of the Google Cloud project to bill for the request. Required for Requester Pays buckets. 
 
 
 - exists(bucket_name, object_name, retry=DEFAULT_RETRY)[source]¶
- Check for the existence of a file in Google Cloud Storage. - Parameters
- bucket_name (str) – The Google Cloud Storage bucket where the object is. 
- object_name (str) – The name of the blob_name to check in the Google cloud storage bucket. 
- retry (google.api_core.retry.Retry) – (Optional) How to retry the RPC 
 
 
 - get_blob_update_time(bucket_name, object_name)[source]¶
- Get the update time of a file in Google Cloud Storage. 
 - is_updated_after(bucket_name, object_name, ts)[source]¶
- Check if an blob_name is updated in Google Cloud Storage. - Parameters
- bucket_name (str) – The Google Cloud Storage bucket where the object is. 
- object_name (str) – The name of the object to check in the Google cloud storage bucket. 
- ts (datetime.datetime) – The timestamp to check against. 
 
 
 - is_updated_between(bucket_name, object_name, min_ts, max_ts)[source]¶
- Check if an blob_name is updated in Google Cloud Storage. - Parameters
- bucket_name (str) – The Google Cloud Storage bucket where the object is. 
- object_name (str) – The name of the object to check in the Google cloud storage bucket. 
- min_ts (datetime.datetime) – The minimum timestamp to check against. 
- max_ts (datetime.datetime) – The maximum timestamp to check against. 
 
 
 - is_updated_before(bucket_name, object_name, ts)[source]¶
- Check if an blob_name is updated before given time in Google Cloud Storage. - Parameters
- bucket_name (str) – The Google Cloud Storage bucket where the object is. 
- object_name (str) – The name of the object to check in the Google cloud storage bucket. 
- ts (datetime.datetime) – The timestamp to check against. 
 
 
 - delete_bucket(bucket_name, force=False, user_project=None)[source]¶
- Delete a bucket object from the Google Cloud Storage. - Parameters
- bucket_name (str) – name of the bucket which will be deleted 
- force (bool) – false not allow to delete non empty bucket, set force=True allows to delete non empty bucket 
- user_project (str | None) – The identifier of the Google Cloud project to bill for the request. Required for Requester Pays buckets. 
 
 
 - list(bucket_name, versions=None, max_results=None, prefix=None, delimiter=None, match_glob=None, user_project=None)[source]¶
- List all objects from the bucket with the given a single prefix or multiple prefixes. - Parameters
- bucket_name (str) – bucket name 
- versions (bool | None) – if true, list all versions of the objects 
- max_results (int | None) – max count of items to return in a single page of responses 
- prefix (str | List[str] | None) – string or list of strings which filter objects whose name begin with it/them 
- delimiter (str | None) – (Deprecated) filters objects based on the delimiter (for e.g ‘.csv’) 
- match_glob (str | None) – (Optional) filters objects based on the glob pattern given by the string (e.g, - '**/*/.json').
- user_project (str | None) – The identifier of the Google Cloud project to bill for the request. Required for Requester Pays buckets. 
 
- Returns
- a stream of object names matching the filtering criteria 
 
 - list_by_timespan(bucket_name, timespan_start, timespan_end, versions=None, max_results=None, prefix=None, delimiter=None, match_glob=None)[source]¶
- List all objects from the bucket with the given string prefix that were updated in the time range. - Parameters
- bucket_name (str) – bucket name 
- timespan_start (datetime.datetime) – will return objects that were updated at or after this datetime (UTC) 
- timespan_end (datetime.datetime) – will return objects that were updated before this datetime (UTC) 
- versions (bool | None) – if true, list all versions of the objects 
- max_results (int | None) – max count of items to return in a single page of responses 
- prefix (str | None) – prefix string which filters objects whose name begin with this prefix 
- delimiter (str | None) – (Deprecated) filters objects based on the delimiter (for e.g ‘.csv’) 
- match_glob (str | None) – (Optional) filters objects based on the glob pattern given by the string (e.g, - '**/*/.json').
 
- Returns
- a stream of object names matching the filtering criteria 
- Return type
- List[str] 
 
 - get_crc32c(bucket_name, object_name)[source]¶
- Get the CRC32c checksum of an object in Google Cloud Storage. 
 - get_md5hash(bucket_name, object_name)[source]¶
- Get the MD5 hash of an object in Google Cloud Storage. 
 - create_bucket(bucket_name, resource=None, storage_class='MULTI_REGIONAL', location='US', project_id=PROVIDE_PROJECT_ID, labels=None)[source]¶
- Create a new bucket. - Google Cloud Storage uses a flat namespace, so you can’t create a bucket with a name that is already in use. - See also - For more information, see Bucket Naming Guidelines: https://cloud.google.com/storage/docs/bucketnaming.html#requirements - Parameters
- bucket_name (str) – The name of the bucket. 
- resource (dict | None) – An optional dict with parameters for creating the bucket. For information on available parameters, see Cloud Storage API doc: https://cloud.google.com/storage/docs/json_api/v1/buckets/insert 
- storage_class (str) – - This defines how objects in the bucket are stored and determines the SLA and the cost of storage. Values include - MULTI_REGIONAL
- REGIONAL
- STANDARD
- NEARLINE
- COLDLINE.
 - If this value is not specified when the bucket is created, it will default to STANDARD. 
- location (str) – - The location of the bucket. Object data for objects in the bucket resides in physical storage within this region. Defaults to US. 
- project_id (str) – The ID of the Google Cloud Project. 
- labels (dict | None) – User-provided labels, in key/value pairs. 
 
- Returns
- If successful, it returns the - idof the bucket.
- Return type
 
 - insert_bucket_acl(bucket_name, entity, role, user_project=None)[source]¶
- Create a new ACL entry on the specified bucket_name. - See: https://cloud.google.com/storage/docs/json_api/v1/bucketAccessControls/insert - Parameters
- bucket_name (str) – Name of a bucket_name. 
- entity (str) – The entity holding the permission, in one of the following forms: user-userId, user-email, group-groupId, group-email, domain-domain, project-team-projectId, allUsers, allAuthenticatedUsers. See: https://cloud.google.com/storage/docs/access-control/lists#scopes 
- role (str) – The access permission for the entity. Acceptable values are: “OWNER”, “READER”, “WRITER”. 
- user_project (str | None) – (Optional) The project to be billed for this request. Required for Requester Pays buckets. 
 
 
 - insert_object_acl(bucket_name, object_name, entity, role, generation=None, user_project=None)[source]¶
- Create a new ACL entry on the specified object. - See: https://cloud.google.com/storage/docs/json_api/v1/objectAccessControls/insert - Parameters
- bucket_name (str) – Name of a bucket_name. 
- object_name (str) – Name of the object. For information about how to URL encode object names to be path safe, see: https://cloud.google.com/storage/docs/json_api/#encoding 
- entity (str) – The entity holding the permission, in one of the following forms: user-userId, user-email, group-groupId, group-email, domain-domain, project-team-projectId, allUsers, allAuthenticatedUsers See: https://cloud.google.com/storage/docs/access-control/lists#scopes 
- role (str) – The access permission for the entity. Acceptable values are: “OWNER”, “READER”. 
- generation (int | None) – Optional. If present, selects a specific revision of this object. 
- user_project (str | None) – (Optional) The project to be billed for this request. Required for Requester Pays buckets. 
 
 
 - compose(bucket_name, source_objects, destination_object)[source]¶
- Composes a list of existing object into a new object in the same storage bucket_name. - Currently it only supports up to 32 objects that can be concatenated in a single operation - https://cloud.google.com/storage/docs/json_api/v1/objects/compose - Parameters
- bucket_name (str) – The name of the bucket containing the source objects. This is also the same bucket to store the composed destination object. 
- source_objects (List[str]) – The list of source objects that will be composed into a single object. 
- destination_object (str) – The path of the object if given. 
 
 
 - sync(source_bucket, destination_bucket, source_object=None, destination_object=None, recursive=True, allow_overwrite=False, delete_extra_files=False)[source]¶
- Synchronize the contents of the buckets. - Parameters - source_objectand- destination_objectdescribe the root sync directories. If they are not passed, the entire bucket will be synchronized. If they are passed, they should point to directories.- Note - The synchronization of individual files is not supported. Only entire directories can be synchronized. - Parameters
- source_bucket (str) – The name of the bucket containing the source objects. 
- destination_bucket (str) – The name of the bucket containing the destination objects. 
- source_object (str | None) – The root sync directory in the source bucket. 
- destination_object (str | None) – The root sync directory in the destination bucket. 
- recursive (bool) – If True, subdirectories will be considered 
- recursive – If True, subdirectories will be considered 
- allow_overwrite (bool) – if True, the files will be overwritten if a mismatched file is found. By default, overwriting files is not allowed 
- delete_extra_files (bool) – - if True, deletes additional files from the source that not found in the destination. By default extra files are not deleted. - Note - This option can delete data quickly if you specify the wrong source/destination combination. 
 
- Returns
- none 
- Return type
- None 
 
 
- airflow.providers.google.cloud.hooks.gcs.gcs_object_is_directory(bucket)[source]¶
- Return True if given Google Cloud Storage URL (gs://<bucket>/<blob>) is a directory or empty bucket. 
- airflow.providers.google.cloud.hooks.gcs.parse_json_from_gcs(gcp_conn_id, file_uri, impersonation_chain=None)[source]¶
- Download and parses json file from Google cloud Storage. 
- class airflow.providers.google.cloud.hooks.gcs.GCSAsyncHook(**kwargs)[source]¶
- Bases: - airflow.providers.google.common.hooks.base_google.GoogleBaseAsyncHook- GCSAsyncHook run on the trigger worker, inherits from GoogleBaseAsyncHook.