airflow.providers.google.cloud.hooks.gcs

This module contains a Google Cloud Storage hook.

Module Contents

Classes

GCSHook

Use the Google Cloud connection to interact with Google Cloud Storage.

GCSAsyncHook

GCSAsyncHook run on the trigger worker, inherits from GoogleBaseAsyncHook.

Functions

gcs_object_is_directory(bucket)

Return True if given Google Cloud Storage URL (gs://<bucket>/<blob>) is a directory or empty bucket.

parse_json_from_gcs(gcp_conn_id, file_uri[, ...])

Download and parses json file from Google cloud Storage.

Attributes

RT

T

FParams

List

DEFAULT_TIMEOUT

PROVIDE_BUCKET

airflow.providers.google.cloud.hooks.gcs.RT[source]
airflow.providers.google.cloud.hooks.gcs.T[source]
airflow.providers.google.cloud.hooks.gcs.FParams[source]
airflow.providers.google.cloud.hooks.gcs.List[source]
airflow.providers.google.cloud.hooks.gcs.DEFAULT_TIMEOUT = 60[source]
airflow.providers.google.cloud.hooks.gcs.PROVIDE_BUCKET: str[source]
class airflow.providers.google.cloud.hooks.gcs.GCSHook(gcp_conn_id='google_cloud_default', impersonation_chain=None, **kwargs)[source]

Bases: airflow.providers.google.common.hooks.base_google.GoogleBaseHook

Use the Google Cloud connection to interact with Google Cloud Storage.

get_conn()[source]

Return a Google Cloud Storage service object.

copy(source_bucket, source_object, destination_bucket=None, destination_object=None)[source]

Copy an object from a bucket to another, with renaming if requested.

destination_bucket or destination_object can be omitted, in which case source bucket/object is used, but not both.

Parameters
  • source_bucket (str) – The bucket of the object to copy from.

  • source_object (str) – The object to copy.

  • destination_bucket (str | None) – The destination of the object to copied to. Can be omitted; then the same bucket is used.

  • destination_object (str | None) – The (renamed) path of the object if given. Can be omitted; then the same name is used.

rewrite(source_bucket, source_object, destination_bucket, destination_object=None)[source]

Similar to copy; supports files over 5 TB, and copying between locations and/or storage classes.

destination_object can be omitted, in which case source_object is used.

Parameters
  • source_bucket (str) – The bucket of the object to copy from.

  • source_object (str) – The object to copy.

  • destination_bucket (str) – The destination of the object to copied to.

  • destination_object (str | None) – The (renamed) path of the object if given. Can be omitted; then the same name is used.

download(bucket_name: str, object_name: str, filename: None = None, chunk_size: int | None = None, timeout: int | None = DEFAULT_TIMEOUT, num_max_attempts: int | None = 1, user_project: str | None = None) bytes[source]
download(bucket_name: str, object_name: str, filename: str, chunk_size: int | None = None, timeout: int | None = DEFAULT_TIMEOUT, num_max_attempts: int | None = 1, user_project: str | None = None) str

Download a file from Google Cloud Storage.

When no filename is supplied, the operator loads the file into memory and returns its content. When a filename is supplied, it writes the file to the specified location and returns the location. For file sizes that exceed the available memory it is recommended to write to a file.

Parameters
  • bucket_name – The bucket to fetch from.

  • object_name – The object to fetch.

  • filename – If set, a local file path where the file should be written to.

  • chunk_size – Blob chunk size.

  • timeout – Request timeout in seconds.

  • num_max_attempts – Number of attempts to download the file.

  • user_project – The identifier of the Google Cloud project to bill for the request. Required for Requester Pays buckets.

download_as_byte_array(bucket_name, object_name, chunk_size=None, timeout=DEFAULT_TIMEOUT, num_max_attempts=1)[source]

Download a file from Google Cloud Storage.

When no filename is supplied, the operator loads the file into memory and returns its content. When a filename is supplied, it writes the file to the specified location and returns the location. For file sizes that exceed the available memory it is recommended to write to a file.

Parameters
  • bucket_name (str) – The bucket to fetch from.

  • object_name (str) – The object to fetch.

  • chunk_size (int | None) – Blob chunk size.

  • timeout (int | None) – Request timeout in seconds.

  • num_max_attempts (int | None) – Number of attempts to download the file.

provide_file(bucket_name=PROVIDE_BUCKET, object_name=None, object_url=None, dir=None, user_project=None)[source]

Download the file to a temporary directory and returns a file handle.

You can use this method by passing the bucket_name and object_name parameters or just object_url parameter.

Parameters
  • bucket_name (str) – The bucket to fetch from.

  • object_name (str | None) – The object to fetch.

  • object_url (str | None) – File reference url. Must start with “gs: //”

  • dir (str | None) – The tmp sub directory to download the file to. (passed to NamedTemporaryFile)

  • user_project (str | None) – The identifier of the Google Cloud project to bill for the request. Required for Requester Pays buckets.

Returns

File handler

Return type

Generator[IO[bytes], None, None]

provide_file_and_upload(bucket_name=PROVIDE_BUCKET, object_name=None, object_url=None, user_project=None)[source]

Create temporary file, returns a file handle and uploads the files content on close.

You can use this method by passing the bucket_name and object_name parameters or just object_url parameter.

Parameters
  • bucket_name (str) – The bucket to fetch from.

  • object_name (str | None) – The object to fetch.

  • object_url (str | None) – File reference url. Must start with “gs: //”

  • user_project (str | None) – The identifier of the Google Cloud project to bill for the request. Required for Requester Pays buckets.

Returns

File handler

Return type

Generator[IO[bytes], None, None]

upload(bucket_name, object_name, filename=None, data=None, mime_type=None, gzip=False, encoding='utf-8', chunk_size=None, timeout=DEFAULT_TIMEOUT, num_max_attempts=1, metadata=None, cache_control=None, user_project=None)[source]

Upload a local file or file data as string or bytes to Google Cloud Storage.

Parameters
  • bucket_name (str) – The bucket to upload to.

  • object_name (str) – The object name to set when uploading the file.

  • filename (str | None) – The local file path to the file to be uploaded.

  • data (str | bytes | None) – The file’s data as a string or bytes to be uploaded.

  • mime_type (str | None) – The file’s mime type set when uploading the file.

  • gzip (bool) – Option to compress local file or file data for upload

  • encoding (str) – bytes encoding for file data if provided as string

  • chunk_size (int | None) – Blob chunk size.

  • timeout (int | None) – Request timeout in seconds.

  • num_max_attempts (int) – Number of attempts to try to upload the file.

  • metadata (dict | None) – The metadata to be uploaded with the file.

  • cache_control (str | None) – Cache-Control metadata field.

  • user_project (str | None) – The identifier of the Google Cloud project to bill for the request. Required for Requester Pays buckets.

exists(bucket_name, object_name, retry=DEFAULT_RETRY)[source]

Check for the existence of a file in Google Cloud Storage.

Parameters
  • bucket_name (str) – The Google Cloud Storage bucket where the object is.

  • object_name (str) – The name of the blob_name to check in the Google cloud storage bucket.

  • retry (google.api_core.retry.Retry) – (Optional) How to retry the RPC

get_blob_update_time(bucket_name, object_name)[source]

Get the update time of a file in Google Cloud Storage.

Parameters
  • bucket_name (str) – The Google Cloud Storage bucket where the object is.

  • object_name (str) – The name of the blob to get updated time from the Google cloud storage bucket.

is_updated_after(bucket_name, object_name, ts)[source]

Check if an blob_name is updated in Google Cloud Storage.

Parameters
  • bucket_name (str) – The Google Cloud Storage bucket where the object is.

  • object_name (str) – The name of the object to check in the Google cloud storage bucket.

  • ts (datetime.datetime) – The timestamp to check against.

is_updated_between(bucket_name, object_name, min_ts, max_ts)[source]

Check if an blob_name is updated in Google Cloud Storage.

Parameters
  • bucket_name (str) – The Google Cloud Storage bucket where the object is.

  • object_name (str) – The name of the object to check in the Google cloud storage bucket.

  • min_ts (datetime.datetime) – The minimum timestamp to check against.

  • max_ts (datetime.datetime) – The maximum timestamp to check against.

is_updated_before(bucket_name, object_name, ts)[source]

Check if an blob_name is updated before given time in Google Cloud Storage.

Parameters
  • bucket_name (str) – The Google Cloud Storage bucket where the object is.

  • object_name (str) – The name of the object to check in the Google cloud storage bucket.

  • ts (datetime.datetime) – The timestamp to check against.

is_older_than(bucket_name, object_name, seconds)[source]

Check if object is older than given time.

Parameters
  • bucket_name (str) – The Google Cloud Storage bucket where the object is.

  • object_name (str) – The name of the object to check in the Google cloud storage bucket.

  • seconds (int) – The time in seconds to check against

delete(bucket_name, object_name)[source]

Delete an object from the bucket.

Parameters
  • bucket_name (str) – name of the bucket, where the object resides

  • object_name (str) – name of the object to delete

delete_bucket(bucket_name, force=False, user_project=None)[source]

Delete a bucket object from the Google Cloud Storage.

Parameters
  • bucket_name (str) – name of the bucket which will be deleted

  • force (bool) – false not allow to delete non empty bucket, set force=True allows to delete non empty bucket

  • user_project (str | None) – The identifier of the Google Cloud project to bill for the request. Required for Requester Pays buckets.

list(bucket_name, versions=None, max_results=None, prefix=None, delimiter=None, match_glob=None, user_project=None)[source]

List all objects from the bucket with the given a single prefix or multiple prefixes.

Parameters
  • bucket_name (str) – bucket name

  • versions (bool | None) – if true, list all versions of the objects

  • max_results (int | None) – max count of items to return in a single page of responses

  • prefix (str | List[str] | None) – string or list of strings which filter objects whose name begin with it/them

  • delimiter (str | None) – (Deprecated) filters objects based on the delimiter (for e.g ‘.csv’)

  • match_glob (str | None) – (Optional) filters objects based on the glob pattern given by the string (e.g, '**/*/.json').

  • user_project (str | None) – The identifier of the Google Cloud project to bill for the request. Required for Requester Pays buckets.

Returns

a stream of object names matching the filtering criteria

list_by_timespan(bucket_name, timespan_start, timespan_end, versions=None, max_results=None, prefix=None, delimiter=None, match_glob=None)[source]

List all objects from the bucket with the given string prefix that were updated in the time range.

Parameters
  • bucket_name (str) – bucket name

  • timespan_start (datetime.datetime) – will return objects that were updated at or after this datetime (UTC)

  • timespan_end (datetime.datetime) – will return objects that were updated before this datetime (UTC)

  • versions (bool | None) – if true, list all versions of the objects

  • max_results (int | None) – max count of items to return in a single page of responses

  • prefix (str | None) – prefix string which filters objects whose name begin with this prefix

  • delimiter (str | None) – (Deprecated) filters objects based on the delimiter (for e.g ‘.csv’)

  • match_glob (str | None) – (Optional) filters objects based on the glob pattern given by the string (e.g, '**/*/.json').

Returns

a stream of object names matching the filtering criteria

Return type

List[str]

get_size(bucket_name, object_name)[source]

Get the size of a file in Google Cloud Storage.

Parameters
  • bucket_name (str) – The Google Cloud Storage bucket where the blob_name is.

  • object_name (str) – The name of the object to check in the Google cloud storage bucket_name.

get_crc32c(bucket_name, object_name)[source]

Get the CRC32c checksum of an object in Google Cloud Storage.

Parameters
  • bucket_name (str) – The Google Cloud Storage bucket where the blob_name is.

  • object_name (str) – The name of the object to check in the Google cloud storage bucket_name.

get_md5hash(bucket_name, object_name)[source]

Get the MD5 hash of an object in Google Cloud Storage.

Parameters
  • bucket_name (str) – The Google Cloud Storage bucket where the blob_name is.

  • object_name (str) – The name of the object to check in the Google cloud storage bucket_name.

create_bucket(bucket_name, resource=None, storage_class='MULTI_REGIONAL', location='US', project_id=None, labels=None)[source]

Create a new bucket.

Google Cloud Storage uses a flat namespace, so you can’t create a bucket with a name that is already in use.

See also

For more information, see Bucket Naming Guidelines: https://cloud.google.com/storage/docs/bucketnaming.html#requirements

Parameters
  • bucket_name (str) – The name of the bucket.

  • resource (dict | None) – An optional dict with parameters for creating the bucket. For information on available parameters, see Cloud Storage API doc: https://cloud.google.com/storage/docs/json_api/v1/buckets/insert

  • storage_class (str) –

    This defines how objects in the bucket are stored and determines the SLA and the cost of storage. Values include

    • MULTI_REGIONAL

    • REGIONAL

    • STANDARD

    • NEARLINE

    • COLDLINE.

    If this value is not specified when the bucket is created, it will default to STANDARD.

  • location (str) –

    The location of the bucket. Object data for objects in the bucket resides in physical storage within this region. Defaults to US.

  • project_id (str | None) – The ID of the Google Cloud Project.

  • labels (dict | None) – User-provided labels, in key/value pairs.

Returns

If successful, it returns the id of the bucket.

Return type

str

insert_bucket_acl(bucket_name, entity, role, user_project=None)[source]

Create a new ACL entry on the specified bucket_name.

See: https://cloud.google.com/storage/docs/json_api/v1/bucketAccessControls/insert

Parameters
  • bucket_name (str) – Name of a bucket_name.

  • entity (str) – The entity holding the permission, in one of the following forms: user-userId, user-email, group-groupId, group-email, domain-domain, project-team-projectId, allUsers, allAuthenticatedUsers. See: https://cloud.google.com/storage/docs/access-control/lists#scopes

  • role (str) – The access permission for the entity. Acceptable values are: “OWNER”, “READER”, “WRITER”.

  • user_project (str | None) – (Optional) The project to be billed for this request. Required for Requester Pays buckets.

insert_object_acl(bucket_name, object_name, entity, role, generation=None, user_project=None)[source]

Create a new ACL entry on the specified object.

See: https://cloud.google.com/storage/docs/json_api/v1/objectAccessControls/insert

Parameters
  • bucket_name (str) – Name of a bucket_name.

  • object_name (str) – Name of the object. For information about how to URL encode object names to be path safe, see: https://cloud.google.com/storage/docs/json_api/#encoding

  • entity (str) – The entity holding the permission, in one of the following forms: user-userId, user-email, group-groupId, group-email, domain-domain, project-team-projectId, allUsers, allAuthenticatedUsers See: https://cloud.google.com/storage/docs/access-control/lists#scopes

  • role (str) – The access permission for the entity. Acceptable values are: “OWNER”, “READER”.

  • generation (int | None) – Optional. If present, selects a specific revision of this object.

  • user_project (str | None) – (Optional) The project to be billed for this request. Required for Requester Pays buckets.

compose(bucket_name, source_objects, destination_object)[source]

Composes a list of existing object into a new object in the same storage bucket_name.

Currently it only supports up to 32 objects that can be concatenated in a single operation

https://cloud.google.com/storage/docs/json_api/v1/objects/compose

Parameters
  • bucket_name (str) – The name of the bucket containing the source objects. This is also the same bucket to store the composed destination object.

  • source_objects (List[str]) – The list of source objects that will be composed into a single object.

  • destination_object (str) – The path of the object if given.

sync(source_bucket, destination_bucket, source_object=None, destination_object=None, recursive=True, allow_overwrite=False, delete_extra_files=False)[source]

Synchronize the contents of the buckets.

Parameters source_object and destination_object describe the root sync directories. If they are not passed, the entire bucket will be synchronized. If they are passed, they should point to directories.

Note

The synchronization of individual files is not supported. Only entire directories can be synchronized.

Parameters
  • source_bucket (str) – The name of the bucket containing the source objects.

  • destination_bucket (str) – The name of the bucket containing the destination objects.

  • source_object (str | None) – The root sync directory in the source bucket.

  • destination_object (str | None) – The root sync directory in the destination bucket.

  • recursive (bool) – If True, subdirectories will be considered

  • recursive – If True, subdirectories will be considered

  • allow_overwrite (bool) – if True, the files will be overwritten if a mismatched file is found. By default, overwriting files is not allowed

  • delete_extra_files (bool) –

    if True, deletes additional files from the source that not found in the destination. By default extra files are not deleted.

    Note

    This option can delete data quickly if you specify the wrong source/destination combination.

Returns

none

Return type

None

airflow.providers.google.cloud.hooks.gcs.gcs_object_is_directory(bucket)[source]

Return True if given Google Cloud Storage URL (gs://<bucket>/<blob>) is a directory or empty bucket.

airflow.providers.google.cloud.hooks.gcs.parse_json_from_gcs(gcp_conn_id, file_uri, impersonation_chain=None)[source]

Download and parses json file from Google cloud Storage.

Parameters
  • gcp_conn_id (str) – Airflow Google Cloud connection ID.

  • file_uri (str) – full path to json file example: gs://test-bucket/dir1/dir2/file

class airflow.providers.google.cloud.hooks.gcs.GCSAsyncHook(**kwargs)[source]

Bases: airflow.providers.google.common.hooks.base_google.GoogleBaseAsyncHook

GCSAsyncHook run on the trigger worker, inherits from GoogleBaseAsyncHook.

sync_hook_class[source]
async get_storage_client(session)[source]

Return a Google Cloud Storage service object.

Was this entry helpful?