`airflow.providers.google.cloud.hooks.gcs`¶

This module contains a Google Cloud Storage hook.

Module Contents¶

airflow.providers.google.cloud.hooks.gcs.RT[source]¶

airflow.providers.google.cloud.hooks.gcs.T[source]¶

airflow.providers.google.cloud.hooks.gcs.DEFAULT_TIMEOUT = 60[source]¶

class airflow.providers.google.cloud.hooks.gcs.GCSHook(gcp_conn_id: str = 'google_cloud_default', delegate_to: Optional[str] = None, google_cloud_storage_conn_id: Optional[str] = None, impersonation_chain: Optional[Union[str, Sequence[str]]] = None)[source]¶

Bases: airflow.providers.google.common.hooks.base_google.GoogleBaseHook

Interact with Google Cloud Storage. This hook uses the Google Cloud connection.

get_conn(self)[source]¶: Returns a Google Cloud Storage service object.

copy(self, source_bucket: str, source_object: str, destination_bucket: Optional[str] = None, destination_object: Optional[str] = None)[source]¶

Copies an object from a bucket to another, with renaming if requested.

destination_bucket or destination_object can be omitted, in which case source bucket/object is used, but not both.

Parameters

source_bucket (str) – The bucket of the object to copy from.
source_object (str) – The object to copy.
destination_bucket (str) – The destination of the object to copied to. Can be omitted; then the same bucket is used.
destination_object (str) – The (renamed) path of the object if given. Can be omitted; then the same name is used.

rewrite(self, source_bucket: str, source_object: str, destination_bucket: str, destination_object: Optional[str] = None)[source]¶

Has the same functionality as copy, except that will work on files over 5 TB, as well as when copying between locations and/or storage classes.

destination_object can be omitted, in which case source_object is used.

Parameters

source_bucket (str) – The bucket of the object to copy from.
source_object (str) – The object to copy.
destination_bucket (str) – The destination of the object to copied to.
destination_object (str) – The (renamed) path of the object if given. Can be omitted; then the same name is used.

download(self, bucket_name: str, object_name: str, filename: Optional[str] = None, chunk_size: Optional[int] = None, timeout: Optional[int] = DEFAULT_TIMEOUT, num_max_attempts: Optional[int] = 1)[source]¶

Downloads a file from Google Cloud Storage.

When no filename is supplied, the operator loads the file into memory and returns its content. When a filename is supplied, it writes the file to the specified location and returns the location. For file sizes that exceed the available memory it is recommended to write to a file.

Parameters

bucket_name (str) – The bucket to fetch from.
object_name (str) – The object to fetch.
filename (str) – If set, a local file path where the file should be written to.
chunk_size (int) – Blob chunk size.
timeout (int) – Request timeout in seconds.
num_max_attempts (int) – Number of attempts to download the file.

provide_file(self, bucket_name: Optional[str] = None, object_name: Optional[str] = None, object_url: Optional[str] = None)[source]¶

Downloads the file to a temporary directory and returns a file handle

You can use this method by passing the bucket_name and object_name parameters or just object_url parameter.

Parameters

bucket_name (str) – The bucket to fetch from.
object_name (str) – The object to fetch.
object_url (str) – File reference url. Must start with “gs: //”

Returns

File handler

provide_file_and_upload(self, bucket_name: Optional[str] = None, object_name: Optional[str] = None, object_url: Optional[str] = None)[source]¶

Creates temporary file, returns a file handle and uploads the files content on close.

You can use this method by passing the bucket_name and object_name parameters or just object_url parameter.

Parameters

bucket_name (str) – The bucket to fetch from.
object_name (str) – The object to fetch.
object_url (str) – File reference url. Must start with “gs: //”

Returns

File handler

upload(self, bucket_name: str, object_name: str, filename: Optional[str] = None, data: Optional[Union[str, bytes]] = None, mime_type: Optional[str] = None, gzip: bool = False, encoding: str = 'utf-8', chunk_size: Optional[int] = None, timeout: Optional[int] = DEFAULT_TIMEOUT, num_max_attempts: int = 1)[source]¶

Uploads a local file or file data as string or bytes to Google Cloud Storage.

Parameters

bucket_name (str) – The bucket to upload to.
object_name (str) – The object name to set when uploading the file.
filename (str) – The local file path to the file to be uploaded.
data (str) – The file’s data as a string or bytes to be uploaded.
mime_type (str) – The file’s mime type set when uploading the file.
gzip (bool) – Option to compress local file or file data for upload
encoding (str) – bytes encoding for file data if provided as string
chunk_size (int) – Blob chunk size.
timeout (int) – Request timeout in seconds.
num_max_attempts (int) – Number of attempts to try to upload the file.

exists(self, bucket_name: str, object_name: str)[source]¶

Checks for the existence of a file in Google Cloud Storage.

Parameters

bucket_name (str) – The Google Cloud Storage bucket where the object is.
object_name (str) – The name of the blob_name to check in the Google cloud storage bucket.

get_blob_update_time(self, bucket_name: str, object_name: str)[source]¶

Get the update time of a file in Google Cloud Storage

Parameters

bucket_name (str) – The Google Cloud Storage bucket where the object is.
object_name (str) – The name of the blob to get updated time from the Google cloud storage bucket.

is_updated_after(self, bucket_name: str, object_name: str, ts: datetime)[source]¶

Checks if an blob_name is updated in Google Cloud Storage.

Parameters

bucket_name (str) – The Google Cloud Storage bucket where the object is.
object_name (str) – The name of the object to check in the Google cloud storage bucket.
ts (datetime.datetime) – The timestamp to check against.

is_updated_between(self, bucket_name: str, object_name: str, min_ts: datetime, max_ts: datetime)[source]¶

Checks if an blob_name is updated in Google Cloud Storage.

Parameters

bucket_name (str) – The Google Cloud Storage bucket where the object is.
object_name (str) – The name of the object to check in the Google cloud storage bucket.
min_ts (datetime.datetime) – The minimum timestamp to check against.
max_ts (datetime.datetime) – The maximum timestamp to check against.

is_updated_before(self, bucket_name: str, object_name: str, ts: datetime)[source]¶

Checks if an blob_name is updated before given time in Google Cloud Storage.

Parameters

bucket_name (str) – The Google Cloud Storage bucket where the object is.
object_name (str) – The name of the object to check in the Google cloud storage bucket.
ts (datetime.datetime) – The timestamp to check against.

is_older_than(self, bucket_name: str, object_name: str, seconds: int)[source]¶

Check if object is older than given time

Parameters

bucket_name (str) – The Google Cloud Storage bucket where the object is.
object_name (str) – The name of the object to check in the Google cloud storage bucket.
seconds (int) – The time in seconds to check against

delete(self, bucket_name: str, object_name: str)[source]¶

Deletes an object from the bucket.

Parameters

bucket_name (str) – name of the bucket, where the object resides
object_name (str) – name of the object to delete

delete_bucket(self, bucket_name: str, force: bool = False)[source]¶

Delete a bucket object from the Google Cloud Storage.

Parameters

bucket_name (str) – name of the bucket which will be deleted
force – false not allow to delete non empty bucket, set force=True allows to delete non empty bucket

Type

bool

list(self, bucket_name, versions=None, max_results=None, prefix=None, delimiter=None)[source]¶

List all objects from the bucket with the give string prefix in name

Parameters

bucket_name (str) – bucket name
versions (bool) – if true, list all versions of the objects
max_results (int) – max count of items to return in a single page of responses
prefix (str) – prefix string which filters objects whose name begin with this prefix
delimiter (str) – filters objects based on the delimiter (for e.g ‘.csv’)

Returns

a stream of object names matching the filtering criteria

list_by_timespan(self, bucket_name: str, timespan_start: datetime, timespan_end: datetime, versions: bool = None, max_results: int = None, prefix: str = None, delimiter: str = None)[source]¶

List all objects from the bucket with the give string prefix in name that were updated in the time between timespan_start and timespan_end.

Parameters

bucket_name (str) – bucket name
timespan_start (datetime) – will return objects that were updated at or after this datetime (UTC)
timespan_end (datetime) – will return objects that were updated before this datetime (UTC)
versions (bool) – if true, list all versions of the objects
max_results (int) – max count of items to return in a single page of responses
prefix (str) – prefix string which filters objects whose name begin with this prefix
delimiter (str) – filters objects based on the delimiter (for e.g ‘.csv’)

Returns

a stream of object names matching the filtering criteria

get_size(self, bucket_name: str, object_name: str)[source]¶

Gets the size of a file in Google Cloud Storage.

Parameters

bucket_name (str) – The Google Cloud Storage bucket where the blob_name is.
object_name (str) – The name of the object to check in the Google cloud storage bucket_name.

get_crc32c(self, bucket_name: str, object_name: str)[source]¶

Gets the CRC32c checksum of an object in Google Cloud Storage.

Parameters

bucket_name (str) – The Google Cloud Storage bucket where the blob_name is.
object_name (str) – The name of the object to check in the Google cloud storage bucket_name.

get_md5hash(self, bucket_name: str, object_name: str)[source]¶

Gets the MD5 hash of an object in Google Cloud Storage.

Parameters

bucket_name (str) – The Google Cloud Storage bucket where the blob_name is.
object_name (str) – The name of the object to check in the Google cloud storage bucket_name.

create_bucket(self, bucket_name: str, resource: Optional[dict] = None, storage_class: str = 'MULTI_REGIONAL', location: str = 'US', project_id: Optional[str] = None, labels: Optional[dict] = None)[source]¶

Creates a new bucket. Google Cloud Storage uses a flat namespace, so you can’t create a bucket with a name that is already in use.

airflow.providers.google.cloud.hooks.gcs¶

Module Contents¶

`airflow.providers.google.cloud.hooks.gcs`¶