airflow.providers.google.cloud.hooks.gcs¶

This module contains a Google Cloud Storage hook.

Attributes¶

`RT`
`T`
`FParams`
`List`
`DEFAULT_TIMEOUT`
`PROVIDE_BUCKET`

Classes¶

`GCSHook`	Use the Google Cloud connection to interact with Google Cloud Storage.
`GCSAsyncHook`	GCSAsyncHook run on the trigger worker, inherits from GoogleBaseAsyncHook.

Functions¶

`gcs_object_is_directory`(bucket)	Return True if given Google Cloud Storage URL (gs://<bucket>/<blob>) is a directory or empty bucket.
`parse_json_from_gcs`(gcp_conn_id, file_uri[, ...])	Download and parses json file from Google cloud Storage.

Module Contents¶

airflow.providers.google.cloud.hooks.gcs.RT[source]¶

airflow.providers.google.cloud.hooks.gcs.T[source]¶

airflow.providers.google.cloud.hooks.gcs.FParams[source]¶

airflow.providers.google.cloud.hooks.gcs.List[source]¶

airflow.providers.google.cloud.hooks.gcs.DEFAULT_TIMEOUT = 60[source]¶

airflow.providers.google.cloud.hooks.gcs.PROVIDE_BUCKET: str = None[source]¶

class airflow.providers.google.cloud.hooks.gcs.GCSHook(gcp_conn_id='google_cloud_default', impersonation_chain=None, **kwargs)[source]¶

Bases: airflow.providers.google.common.hooks.base_google.GoogleBaseHook

Use the Google Cloud connection to interact with Google Cloud Storage.

get_conn()[source]¶

Return a Google Cloud Storage service object.

copy(source_bucket, source_object, destination_bucket=None, destination_object=None)[source]¶

Copy an object from a bucket to another, with renaming if requested.

destination_bucket or destination_object can be omitted, in which case source bucket/object is used, but not both.

Parameters:

source_bucket (str) – The bucket of the object to copy from.
source_object (str) – The object to copy.
destination_bucket (str | None) – The destination of the object to copied to. Can be omitted; then the same bucket is used.
destination_object (str | None) – The (renamed) path of the object if given. Can be omitted; then the same name is used.

rewrite(source_bucket, source_object, destination_bucket, destination_object=None)[source]¶

Similar to copy; supports files over 5 TB, and copying between locations and/or storage classes.

destination_object can be omitted, in which case source_object is used.

Parameters:

source_bucket (str) – The bucket of the object to copy from.
source_object (str) – The object to copy.
destination_bucket (str) – The destination of the object to copied to.
destination_object (str | None) – The (renamed) path of the object if given. Can be omitted; then the same name is used.

download(bucket_name: str, object_name: str, filename: None = None, chunk_size: int | None = None, timeout: int | None = DEFAULT_TIMEOUT, num_max_attempts: int | None = 1, user_project: str | None = None) → bytes[source]¶

download(bucket_name: str, object_name: str, filename: str, chunk_size: int | None = None, timeout: int | None = DEFAULT_TIMEOUT, num_max_attempts: int | None = 1, user_project: str | None = None) → str

Download a file from Google Cloud Storage.

When no filename is supplied, the operator loads the file into memory and returns its content. When a filename is supplied, it writes the file to the specified location and returns the location. For file sizes that exceed the available memory it is recommended to write to a file.

Parameters:

bucket_name – The bucket to fetch from.
object_name – The object to fetch.
filename – If set, a local file path where the file should be written to.
chunk_size – Blob chunk size.
timeout – Request timeout in seconds.
num_max_attempts – Number of attempts to download the file.
user_project – The identifier of the Google Cloud project to bill for the request. Required for Requester Pays buckets.

download_as_byte_array(bucket_name, object_name, chunk_size=None, timeout=DEFAULT_TIMEOUT, num_max_attempts=1)[source]¶

Download a file from Google Cloud Storage.

When no filename is supplied, the operator loads the file into memory and returns its content. When a filename is supplied, it writes the file to the specified location and returns the location. For file sizes that exceed the available memory it is recommended to write to a file.

Parameters:

bucket_name (str) – The bucket to fetch from.
object_name (str) – The object to fetch.
chunk_size (int | None) – Blob chunk size.
timeout (int | None) – Request timeout in seconds.
num_max_attempts (int | None) – Number of attempts to download the file.

provide_file(bucket_name=PROVIDE_BUCKET, object_name=None, object_url=None, dir=None, user_project=None)[source]¶

Download the file to a temporary directory and returns a file handle.

You can use this method by passing the bucket_name and object_name parameters or just object_url parameter.

Parameters:

bucket_name (str) – The bucket to fetch from.
object_name (str | None) – The object to fetch.
object_url (str | None) – File reference url. Must start with “gs: //”
dir (str | None) – The tmp sub directory to download the file to. (passed to NamedTemporaryFile)
user_project (str | None) – The identifier of the Google Cloud project to bill for the request. Required for Requester Pays buckets.

Returns:

File handler

Return type:

collections.abc.Generator[IO[bytes], None, None]

provide_file_and_upload(bucket_name=PROVIDE_BUCKET, object_name=None, object_url=None, user_project=None)[source]¶

Create temporary file, returns a file handle and uploads the files content on close.

You can use this method by passing the bucket_name and object_name parameters or just object_url parameter.

Parameters:

bucket_name (str) – The bucket to fetch from.
object_name (str | None) – The object to fetch.
object_url (str | None) – File reference url. Must start with “gs: //”
user_project (str | None) – The identifier of the Google Cloud project to bill for the request. Required for Requester Pays buckets.

Returns:

File handler

Return type:

collections.abc.Generator[IO[bytes], None, None]

upload(bucket_name, object_name, filename=None, data=None, mime_type=None, gzip=False, encoding='utf-8', chunk_size=None, timeout=DEFAULT_TIMEOUT, num_max_attempts=1, metadata=None, cache_control=None, user_project=None)[source]¶

Upload a local file or file data as string or bytes to Google Cloud Storage.

Parameters:

bucket_name (str) – The bucket to upload to.
object_name (str) – The object name to set when uploading the file.
filename (str | None) – The local file path to the file to be uploaded.
data (str | bytes | None) – The file’s data as a string or bytes to be uploaded.
mime_type (str | None) – The file’s mime type set when uploading the file.
gzip (bool) – Option to compress local file or file data for upload
encoding (str) – bytes encoding for file data if provided as string
chunk_size (int | None) – Blob chunk size.
timeout (int | None) – Request timeout in seconds.
num_max_attempts (int) – Number of attempts to try to upload the file.
metadata (dict | None) – The metadata to be uploaded with the file.
cache_control (str | None) – Cache-Control metadata field.
user_project (str | None) – The identifier of the Google Cloud project to bill for the request. Required for Requester Pays buckets.

exists(bucket_name, object_name, retry=DEFAULT_RETRY)[source]¶

Check for the existence of a file in Google Cloud Storage.

Parameters:

bucket_name (str) – The Google Cloud Storage bucket where the object is.
object_name (str) – The name of the blob_name to check in the Google cloud storage bucket.
retry (google.api_core.retry.Retry) – (Optional) How to retry the RPC

get_blob_update_time(bucket_name, object_name)[source]¶

Get the update time of a file in Google Cloud Storage.

Parameters:

bucket_name (str) – The Google Cloud Storage bucket where the object is.
object_name (str) – The name of the blob to get updated time from the Google cloud storage bucket.

is_updated_after(bucket_name, object_name, ts)[source]¶

Check if an blob_name is updated in Google Cloud Storage.

Parameters:

bucket_name (str) – The Google Cloud Storage bucket where the object is.
object_name (str) – The name of the object to check in the Google cloud storage bucket.
ts (datetime.datetime) – The timestamp to check against.

is_updated_between(bucket_name, object_name, min_ts, max_ts)[source]¶

Check if an blob_name is updated in Google Cloud Storage.

Parameters:

bucket_name (str) – The Google Cloud Storage bucket where the object is.
object_name (str) – The name of the object to check in the Google cloud storage bucket.
min_ts (datetime.datetime) – The minimum timestamp to check against.
max_ts (datetime.datetime) – The maximum timestamp to check against.

is_updated_before(bucket_name, object_name, ts)[source]¶

Check if an blob_name is updated before given time in Google Cloud Storage.

Parameters:

bucket_name (str) – The Google Cloud Storage bucket where the object is.
object_name (str) – The name of the object to check in the Google cloud storage bucket.
ts (datetime.datetime) – The timestamp to check against.

is_older_than(bucket_name, object_name, seconds)[source]¶

Check if object is older than given time.

Parameters:

bucket_name (str) – The Google Cloud Storage bucket where the object is.
object_name (str) – The name of the object to check in the Google cloud storage bucket.
seconds (int) – The time in seconds to check against

delete(bucket_name, object_name)[source]¶

Delete an object from the bucket.

Parameters:

bucket_name (str) – name of the bucket, where the object resides
object_name (str) – name of the object to delete

delete_bucket(bucket_name, force=False, user_project=None)[source]¶

Delete a bucket object from the Google Cloud Storage.

Parameters:

bucket_name (str) – name of the bucket which will be deleted
force (bool) – false not allow to delete non empty bucket, set force=True allows to delete non empty bucket
user_project (str | None) – The identifier of the Google Cloud project to bill for the request. Required for Requester Pays buckets.

list(bucket_name, versions=None, max_results=None, prefix=None, delimiter=None, match_glob=None, user_project=None)[source]¶

List all objects from the bucket with the given a single prefix or multiple prefixes.

Parameters:

bucket_name (str) – bucket name
versions (bool | None) – if true, list all versions of the objects
max_results (int | None) – max count of items to return in a single page of responses
prefix (str | List[str] | None) – string or list of strings which filter objects whose name begin with it/them
delimiter (str | None) – (Deprecated) filters objects based on the delimiter (for e.g ‘.csv’)
match_glob (str | None) – (Optional) filters objects based on the glob pattern given by the string (e.g, '**/*/.json').
user_project (str | None) – The identifier of the Google Cloud project to bill for the request. Required for Requester Pays buckets.

Returns:

a stream of object names matching the filtering criteria

list_by_timespan(bucket_name, timespan_start, timespan_end, versions=None, max_results=None, prefix=None, delimiter=None, match_glob=None)[source]¶

List all objects from the bucket with the given string prefix that were updated in the time range.

Parameters:

bucket_name (str) – bucket name
timespan_start (datetime.datetime) – will return objects that were updated at or after this datetime (UTC)
timespan_end (datetime.datetime) – will return objects that were updated before this datetime (UTC)
versions (bool | None) – if true, list all versions of the objects
max_results (int | None) – max count of items to return in a single page of responses
prefix (str | None) – prefix string which filters objects whose name begin with this prefix
delimiter (str | None) – (Deprecated) filters objects based on the delimiter (for e.g ‘.csv’)
match_glob (str | None) – (Optional) filters objects based on the glob pattern given by the string (e.g, '**/*/.json').

Returns:

a stream of object names matching the filtering criteria

Return type:

List[str]

get_size(bucket_name, object_name)[source]¶

Get the size of a file in Google Cloud Storage.

Parameters:

bucket_name (str) – The Google Cloud Storage bucket where the blob_name is.
object_name (str) – The name of the object to check in the Google cloud storage bucket_name.

get_crc32c(bucket_name, object_name)[source]¶

Get the CRC32c checksum of an object in Google Cloud Storage.

Parameters:

bucket_name (str) – The Google Cloud Storage bucket where the blob_name is.
object_name (str) – The name of the object to check in the Google cloud storage bucket_name.

get_md5hash(bucket_name, object_name)[source]¶

Get the MD5 hash of an object in Google Cloud Storage.

Parameters:

bucket_name (str) – The Google Cloud Storage bucket where the blob_name is.
object_name (str) – The name of the object to check in the Google cloud storage bucket_name.

get_metadata(bucket_name, object_name)[source]¶

Get the metadata of an object in Google Cloud Storage.

Parameters:

bucket_name (str) – Name of the Google Cloud Storage bucket where the object is.
object_name (str) – The name of the object containing the desired metadata

Returns:

The metadata associated with the object

Return type:

dict | None

create_bucket(bucket_name, resource=None, storage_class='MULTI_REGIONAL', location='US', project_id=PROVIDE_PROJECT_ID, labels=None)[source]¶

Create a new bucket.

Google Cloud Storage uses a flat namespace, so you can’t create a bucket with a name that is already in use.