airflow.providers.google.cloud.hooks.gcs

This module contains a Google Cloud Storage hook.

Module Contents

Classes

GCSHook

Interact with Google Cloud Storage. This hook uses the Google Cloud

GCSAsyncHook

GCSAsyncHook run on the trigger worker, inherits from GoogleBaseHookAsync.

Functions

gcs_object_is_directory(bucket)

Return True if given Google Cloud Storage URL (gs://<bucket>/<blob>)

parse_json_from_gcs(gcp_conn_id, file_uri)

Downloads and parses json file from Google cloud Storage.

Attributes

RT

T

FParams

List

DEFAULT_TIMEOUT

PROVIDE_BUCKET

airflow.providers.google.cloud.hooks.gcs.RT[source]
airflow.providers.google.cloud.hooks.gcs.T[source]
airflow.providers.google.cloud.hooks.gcs.FParams[source]
airflow.providers.google.cloud.hooks.gcs.List[source]
airflow.providers.google.cloud.hooks.gcs.DEFAULT_TIMEOUT = 60[source]
airflow.providers.google.cloud.hooks.gcs.PROVIDE_BUCKET: str[source]
class airflow.providers.google.cloud.hooks.gcs.GCSHook(gcp_conn_id='google_cloud_default', impersonation_chain=None, **kwargs)[source]

Bases: airflow.providers.google.common.hooks.base_google.GoogleBaseHook

Interact with Google Cloud Storage. This hook uses the Google Cloud connection.

get_conn()[source]

Returns a Google Cloud Storage service object.

copy(source_bucket, source_object, destination_bucket=None, destination_object=None)[source]

Copies an object from a bucket to another, with renaming if requested.

destination_bucket or destination_object can be omitted, in which case source bucket/object is used, but not both.

Parameters
  • source_bucket (str) – The bucket of the object to copy from.

  • source_object (str) – The object to copy.

  • destination_bucket (str | None) – The destination of the object to copied to. Can be omitted; then the same bucket is used.

  • destination_object (str | None) – The (renamed) path of the object if given. Can be omitted; then the same name is used.

rewrite(source_bucket, source_object, destination_bucket, destination_object=None)[source]

Has the same functionality as copy, except that will work on files over 5 TB, as well as when copying between locations and/or storage classes.

destination_object can be omitted, in which case source_object is used.

Parameters
  • source_bucket (str) – The bucket of the object to copy from.

  • source_object (str) – The object to copy.

  • destination_bucket (str) – The destination of the object to copied to.

  • destination_object (str | None) – The (renamed) path of the object if given. Can be omitted; then the same name is used.

download(bucket_name: str, object_name: str, filename: None = None, chunk_size: int | None = None, timeout: int | None = DEFAULT_TIMEOUT, num_max_attempts: int | None = 1) bytes[source]
download(bucket_name: str, object_name: str, filename: str, chunk_size: int | None = None, timeout: int | None = DEFAULT_TIMEOUT, num_max_attempts: int | None = 1) str

Downloads a file from Google Cloud Storage.

When no filename is supplied, the operator loads the file into memory and returns its content. When a filename is supplied, it writes the file to the specified location and returns the location. For file sizes that exceed the available memory it is recommended to write to a file.

Parameters
  • bucket_name – The bucket to fetch from.

  • object_name – The object to fetch.

  • filename – If set, a local file path where the file should be written to.

  • chunk_size – Blob chunk size.

  • timeout – Request timeout in seconds.

  • num_max_attempts – Number of attempts to download the file.

download_as_byte_array(bucket_name, object_name, chunk_size=None, timeout=DEFAULT_TIMEOUT, num_max_attempts=1)[source]

Downloads a file from Google Cloud Storage.

When no filename is supplied, the operator loads the file into memory and returns its content. When a filename is supplied, it writes the file to the specified location and returns the location. For file sizes that exceed the available memory it is recommended to write to a file.

Parameters
  • bucket_name (str) – The bucket to fetch from.

  • object_name (str) – The object to fetch.

  • chunk_size (int | None) – Blob chunk size.

  • timeout (int | None) – Request timeout in seconds.

  • num_max_attempts (int | None) – Number of attempts to download the file.

provide_file(bucket_name=PROVIDE_BUCKET, object_name=None, object_url=None, dir=None)[source]

Downloads the file to a temporary directory and returns a file handle.

You can use this method by passing the bucket_name and object_name parameters or just object_url parameter.

Parameters
  • bucket_name (str) – The bucket to fetch from.

  • object_name (str | None) – The object to fetch.

  • object_url (str | None) – File reference url. Must start with “gs: //”

  • dir (str | None) – The tmp sub directory to download the file to. (passed to NamedTemporaryFile)

Returns

File handler

Return type

Generator[IO[bytes], None, None]

provide_file_and_upload(bucket_name=PROVIDE_BUCKET, object_name=None, object_url=None)[source]

Creates temporary file, returns a file handle and uploads the files content on close.

You can use this method by passing the bucket_name and object_name parameters or just object_url parameter.

Parameters
  • bucket_name (str) – The bucket to fetch from.

  • object_name (str | None) – The object to fetch.

  • object_url (str | None) – File reference url. Must start with “gs: //”

Returns

File handler

Return type

Generator[IO[bytes], None, None]

upload(bucket_name, object_name, filename=None, data=None, mime_type=None, gzip=False, encoding='utf-8', chunk_size=None, timeout=DEFAULT_TIMEOUT, num_max_attempts=1, metadata=None, cache_control=None)[source]

Uploads a local file or file data as string or bytes to Google Cloud Storage.

Parameters
  • bucket_name (str) – The bucket to upload to.

  • object_name (str) – The object name to set when uploading the file.

  • filename (str | None) – The local file path to the file to be uploaded.

  • data (str | bytes | None) – The file’s data as a string or bytes to be uploaded.

  • mime_type (str | None) – The file’s mime type set when uploading the file.

  • gzip (bool) – Option to compress local file or file data for upload

  • encoding (str) – bytes encoding for file data if provided as string

  • chunk_size (int | None) – Blob chunk size.

  • timeout (int | None) – Request timeout in seconds.

  • num_max_attempts (int) – Number of attempts to try to upload the file.

  • metadata (dict | None) – The metadata to be uploaded with the file.

  • cache_control (str | None) – Cache-Control metadata field.

exists(bucket_name, object_name, retry=DEFAULT_RETRY)[source]

Checks for the existence of a file in Google Cloud Storage.

Parameters
  • bucket_name (str) – The Google Cloud Storage bucket where the object is.

  • object_name (str) – The name of the blob_name to check in the Google cloud storage bucket.

  • retry (google.api_core.retry.Retry) – (Optional) How to retry the RPC

get_blob_update_time(bucket_name, object_name)[source]

Get the update time of a file in Google Cloud Storage.

Parameters
  • bucket_name (str) – The Google Cloud Storage bucket where the object is.

  • object_name (str) – The name of the blob to get updated time from the Google cloud storage bucket.

is_updated_after(bucket_name, object_name, ts)[source]

Checks if an blob_name is updated in Google Cloud Storage.

Parameters
  • bucket_name (str) – The Google Cloud Storage bucket where the object is.

  • object_name (str) – The name of the object to check in the Google cloud storage bucket.

  • ts (datetime.datetime) – The timestamp to check against.

is_updated_between(bucket_name, object_name, min_ts, max_ts)[source]

Checks if an blob_name is updated in Google Cloud Storage.

Parameters
  • bucket_name (str) – The Google Cloud Storage bucket where the object is.

  • object_name (str) – The name of the object to check in the Google cloud storage bucket.

  • min_ts (datetime.datetime) – The minimum timestamp to check against.

  • max_ts (datetime.datetime) – The maximum timestamp to check against.

is_updated_before(bucket_name, object_name, ts)[source]

Checks if an blob_name is updated before given time in Google Cloud Storage.

Parameters
  • bucket_name (str) – The Google Cloud Storage bucket where the object is.

  • object_name (str) – The name of the object to check in the Google cloud storage bucket.

  • ts (datetime.datetime) – The timestamp to check against.

is_older_than(bucket_name, object_name, seconds)[source]

Check if object is older than given time.

Parameters
  • bucket_name (str) – The Google Cloud Storage bucket where the object is.

  • object_name (str) – The name of the object to check in the Google cloud storage bucket.

  • seconds (int) – The time in seconds to check against

delete(bucket_name, object_name)[source]

Deletes an object from the bucket.

Parameters
  • bucket_name (str) – name of the bucket, where the object resides

  • object_name (str) – name of the object to delete

delete_bucket(bucket_name, force=False)[source]

Delete a bucket object from the Google Cloud Storage.

Parameters
  • bucket_name (str) – name of the bucket which will be deleted

  • force (bool) – false not allow to delete non empty bucket, set force=True allows to delete non empty bucket

list(bucket_name, versions=None, max_results=None, prefix=None, delimiter=None, match_glob=None)[source]

List all objects from the bucket with the given a single prefix or multiple prefixes.

Parameters
  • bucket_name (str) – bucket name

  • versions (bool | None) – if true, list all versions of the objects

  • max_results (int | None) – max count of items to return in a single page of responses

  • prefix (str | List[str] | None) – string or list of strings which filter objects whose name begin with it/them

  • delimiter (str | None) – (Deprecated) filters objects based on the delimiter (for e.g ‘.csv’)

  • match_glob (str | None) – (Optional) filters objects based on the glob pattern given by the string (e.g, '**/*/.json').

Returns

a stream of object names matching the filtering criteria

list_by_timespan(bucket_name, timespan_start, timespan_end, versions=None, max_results=None, prefix=None, delimiter=None, match_glob=None)[source]

List all objects from the bucket with the give string prefix in name that were updated in the time between timespan_start and timespan_end.

Parameters
  • bucket_name (str) – bucket name

  • timespan_start (datetime.datetime) – will return objects that were updated at or after this datetime (UTC)

  • timespan_end (datetime.datetime) – will return objects that were updated before this datetime (UTC)

  • versions (bool | None) – if true, list all versions of the objects

  • max_results (int | None) – max count of items to return in a single page of responses

  • prefix (str | None) – prefix string which filters objects whose name begin with this prefix

  • delimiter (str | None) – (Deprecated) filters objects based on the delimiter (for e.g ‘.csv’)

  • match_glob (str | None) – (Optional) filters objects based on the glob pattern given by the string (e.g, '**/*/.json').

Returns

a stream of object names matching the filtering criteria

Return type

List[str]

get_size(bucket_name, object_name)[source]

Gets the size of a file in Google Cloud Storage.

Parameters
  • bucket_name (str) – The Google Cloud Storage bucket where the blob_name is.

  • object_name (str) – The name of the object to check in the Google cloud storage bucket_name.

get_crc32c(bucket_name, object_name)[source]

Gets the CRC32c checksum of an object in Google Cloud Storage.

Parameters
  • bucket_name (str) – The Google Cloud Storage bucket where the blob_name is.

  • object_name (str) – The name of the object to check in the Google cloud storage bucket_name.

get_md5hash(bucket_name, object_name)[source]

Gets the MD5 hash of an object in Google Cloud Storage.

Parameters
  • bucket_name (str) – The Google Cloud Storage bucket where the blob_name is.

  • object_name (str) – The name of the object to check in the Google cloud storage bucket_name.

create_bucket(bucket_name, resource=None, storage_class='MULTI_REGIONAL', location='US', project_id=None, labels=None)[source]

Creates a new bucket. Google Cloud Storage uses a flat namespace, so you can’t create a bucket with a name that is already in use.

See also

For more information, see Bucket Naming Guidelines: https://cloud.google.com/storage/docs/bucketnaming.html#requirements

Parameters
  • bucket_name (str) – The name of the bucket.

  • resource (dict | None) – An optional dict with parameters for creating the bucket. For information on available parameters, see Cloud Storage API doc: https://cloud.google.com/storage/docs/json_api/v1/buckets/insert

  • storage_class (str) –

    This defines how objects in the bucket are stored and determines the SLA and the cost of storage. Values include

    • MULTI_REGIONAL

    • REGIONAL

    • STANDARD

    • NEARLINE

    • COLDLINE.

    If this value is not specified when the bucket is created, it will default to STANDARD.

  • location (str) –

    The location of the bucket. Object data for objects in the bucket resides in physical storage within this region. Defaults to US.

  • project_id (str | None) – The ID of the Google Cloud Project.

  • labels (dict | None) – User-provided labels, in key/value pairs.

Returns

If successful, it returns the id of the bucket.

Return type

str

insert_bucket_acl(bucket_name, entity, role, user_project=None)[source]

Creates a new ACL entry on the specified bucket_name.

See: https://cloud.google.com/storage/docs/json_api/v1/bucketAccessControls/insert

Parameters
  • bucket_name (str) – Name of a bucket_name.

  • entity (str) – The entity holding the permission, in one of the following forms: user-userId, user-email, group-groupId, group-email, domain-domain, project-team-projectId, allUsers, allAuthenticatedUsers. See: https://cloud.google.com/storage/docs/access-control/lists#scopes

  • role (str) – The access permission for the entity. Acceptable values are: “OWNER”, “READER”, “WRITER”.

  • user_project (str | None) – (Optional) The project to be billed for this request. Required for Requester Pays buckets.

insert_object_acl(bucket_name, object_name, entity, role, generation=None, user_project=None)[source]

Creates a new ACL entry on the specified object.

See: https://cloud.google.com/storage/docs/json_api/v1/objectAccessControls/insert

Parameters
  • bucket_name (str) – Name of a bucket_name.

  • object_name (str) – Name of the object. For information about how to URL encode object names to be path safe, see: https://cloud.google.com/storage/docs/json_api/#encoding

  • entity (str) – The entity holding the permission, in one of the following forms: user-userId, user-email, group-groupId, group-email, domain-domain, project-team-projectId, allUsers, allAuthenticatedUsers See: https://cloud.google.com/storage/docs/access-control/lists#scopes

  • role (str) – The access permission for the entity. Acceptable values are: “OWNER”, “READER”.

  • generation (int | None) – Optional. If present, selects a specific revision of this object.

  • user_project (str | None) – (Optional) The project to be billed for this request. Required for Requester Pays buckets.

compose(bucket_name, source_objects, destination_object)[source]

Composes a list of existing object into a new object in the same storage bucket_name.

Currently it only supports up to 32 objects that can be concatenated in a single operation

https://cloud.google.com/storage/docs/json_api/v1/objects/compose

Parameters
  • bucket_name (str) – The name of the bucket containing the source objects. This is also the same bucket to store the composed destination object.

  • source_objects (List[str]) – The list of source objects that will be composed into a single object.

  • destination_object (str) – The path of the object if given.

sync(source_bucket, destination_bucket, source_object=None, destination_object=None, recursive=True, allow_overwrite=False, delete_extra_files=False)[source]

Synchronizes the contents of the buckets.

Parameters source_object and destination_object describe the root sync directories. If they are not passed, the entire bucket will be synchronized. If they are passed, they should point to directories.

Note

The synchronization of individual files is not supported. Only entire directories can be synchronized.

Parameters
  • source_bucket (str) – The name of the bucket containing the source objects.

  • destination_bucket (str) – The name of the bucket containing the destination objects.

  • source_object (str | None) – The root sync directory in the source bucket.

  • destination_object (str | None) – The root sync directory in the destination bucket.

  • recursive (bool) – If True, subdirectories will be considered

  • recursive – If True, subdirectories will be considered

  • allow_overwrite (bool) – if True, the files will be overwritten if a mismatched file is found. By default, overwriting files is not allowed

  • delete_extra_files (bool) –

    if True, deletes additional files from the source that not found in the destination. By default extra files are not deleted.

    Note

    This option can delete data quickly if you specify the wrong source/destination combination.

Returns

none

Return type

None

airflow.providers.google.cloud.hooks.gcs.gcs_object_is_directory(bucket)[source]

Return True if given Google Cloud Storage URL (gs://<bucket>/<blob>) is a directory or an empty bucket. Otherwise return False.

airflow.providers.google.cloud.hooks.gcs.parse_json_from_gcs(gcp_conn_id, file_uri)[source]

Downloads and parses json file from Google cloud Storage.

Parameters
  • gcp_conn_id (str) – Airflow Google Cloud connection ID.

  • file_uri (str) – full path to json file example: gs://test-bucket/dir1/dir2/file

class airflow.providers.google.cloud.hooks.gcs.GCSAsyncHook(**kwargs)[source]

Bases: airflow.providers.google.common.hooks.base_google.GoogleBaseAsyncHook

GCSAsyncHook run on the trigger worker, inherits from GoogleBaseHookAsync.

sync_hook_class[source]
async get_storage_client(session)[source]

Returns a Google Cloud Storage service object.

Was this entry helpful?