airflow.providers.google.cloud.hooks.gcs
¶
This module contains a Google Cloud Storage hook.
Module Contents¶
Classes¶
Use the Google Cloud connection to interact with Google Cloud Storage. |
|
GCSAsyncHook run on the trigger worker, inherits from GoogleBaseHookAsync. |
Functions¶
|
Return True if given Google Cloud Storage URL (gs://<bucket>/<blob>) is a directory or empty bucket. |
|
Downloads and parses json file from Google cloud Storage. |
Attributes¶
- class airflow.providers.google.cloud.hooks.gcs.GCSHook(gcp_conn_id='google_cloud_default', impersonation_chain=None, **kwargs)[source]¶
Bases:
airflow.providers.google.common.hooks.base_google.GoogleBaseHook
Use the Google Cloud connection to interact with Google Cloud Storage.
- copy(source_bucket, source_object, destination_bucket=None, destination_object=None)[source]¶
Copies an object from a bucket to another, with renaming if requested.
destination_bucket or destination_object can be omitted, in which case source bucket/object is used, but not both.
- Parameters
source_bucket (str) – The bucket of the object to copy from.
source_object (str) – The object to copy.
destination_bucket (str | None) – The destination of the object to copied to. Can be omitted; then the same bucket is used.
destination_object (str | None) – The (renamed) path of the object if given. Can be omitted; then the same name is used.
- rewrite(source_bucket, source_object, destination_bucket, destination_object=None)[source]¶
Similar to copy; supports files over 5 TB, and copying between locations and/or storage classes.
destination_object can be omitted, in which case source_object is used.
- Parameters
- download(bucket_name: str, object_name: str, filename: None = None, chunk_size: int | None = None, timeout: int | None = DEFAULT_TIMEOUT, num_max_attempts: int | None = 1, user_project: str | None = None) bytes [source]¶
- download(bucket_name: str, object_name: str, filename: str, chunk_size: int | None = None, timeout: int | None = DEFAULT_TIMEOUT, num_max_attempts: int | None = 1, user_project: str | None = None) str
Downloads a file from Google Cloud Storage.
When no filename is supplied, the operator loads the file into memory and returns its content. When a filename is supplied, it writes the file to the specified location and returns the location. For file sizes that exceed the available memory it is recommended to write to a file.
- Parameters
bucket_name – The bucket to fetch from.
object_name – The object to fetch.
filename – If set, a local file path where the file should be written to.
chunk_size – Blob chunk size.
timeout – Request timeout in seconds.
num_max_attempts – Number of attempts to download the file.
user_project – The identifier of the Google Cloud project to bill for the request. Required for Requester Pays buckets.
- download_as_byte_array(bucket_name, object_name, chunk_size=None, timeout=DEFAULT_TIMEOUT, num_max_attempts=1)[source]¶
Downloads a file from Google Cloud Storage.
When no filename is supplied, the operator loads the file into memory and returns its content. When a filename is supplied, it writes the file to the specified location and returns the location. For file sizes that exceed the available memory it is recommended to write to a file.
- provide_file(bucket_name=PROVIDE_BUCKET, object_name=None, object_url=None, dir=None, user_project=None)[source]¶
Downloads the file to a temporary directory and returns a file handle.
You can use this method by passing the bucket_name and object_name parameters or just object_url parameter.
- Parameters
bucket_name (str) – The bucket to fetch from.
object_name (str | None) – The object to fetch.
object_url (str | None) – File reference url. Must start with “gs: //”
dir (str | None) – The tmp sub directory to download the file to. (passed to NamedTemporaryFile)
user_project (str | None) – The identifier of the Google Cloud project to bill for the request. Required for Requester Pays buckets.
- Returns
File handler
- Return type
Generator[IO[bytes], None, None]
- provide_file_and_upload(bucket_name=PROVIDE_BUCKET, object_name=None, object_url=None, user_project=None)[source]¶
Creates temporary file, returns a file handle and uploads the files content on close.
You can use this method by passing the bucket_name and object_name parameters or just object_url parameter.
- Parameters
- Returns
File handler
- Return type
Generator[IO[bytes], None, None]
- upload(bucket_name, object_name, filename=None, data=None, mime_type=None, gzip=False, encoding='utf-8', chunk_size=None, timeout=DEFAULT_TIMEOUT, num_max_attempts=1, metadata=None, cache_control=None, user_project=None)[source]¶
Uploads a local file or file data as string or bytes to Google Cloud Storage.
- Parameters
bucket_name (str) – The bucket to upload to.
object_name (str) – The object name to set when uploading the file.
filename (str | None) – The local file path to the file to be uploaded.
data (str | bytes | None) – The file’s data as a string or bytes to be uploaded.
mime_type (str | None) – The file’s mime type set when uploading the file.
gzip (bool) – Option to compress local file or file data for upload
encoding (str) – bytes encoding for file data if provided as string
chunk_size (int | None) – Blob chunk size.
timeout (int | None) – Request timeout in seconds.
num_max_attempts (int) – Number of attempts to try to upload the file.
metadata (dict | None) – The metadata to be uploaded with the file.
cache_control (str | None) – Cache-Control metadata field.
user_project (str | None) – The identifier of the Google Cloud project to bill for the request. Required for Requester Pays buckets.
- exists(bucket_name, object_name, retry=DEFAULT_RETRY)[source]¶
Checks for the existence of a file in Google Cloud Storage.
- Parameters
bucket_name (str) – The Google Cloud Storage bucket where the object is.
object_name (str) – The name of the blob_name to check in the Google cloud storage bucket.
retry (google.api_core.retry.Retry) – (Optional) How to retry the RPC
- get_blob_update_time(bucket_name, object_name)[source]¶
Get the update time of a file in Google Cloud Storage.
- is_updated_after(bucket_name, object_name, ts)[source]¶
Checks if an blob_name is updated in Google Cloud Storage.
- Parameters
bucket_name (str) – The Google Cloud Storage bucket where the object is.
object_name (str) – The name of the object to check in the Google cloud storage bucket.
ts (datetime.datetime) – The timestamp to check against.
- is_updated_between(bucket_name, object_name, min_ts, max_ts)[source]¶
Checks if an blob_name is updated in Google Cloud Storage.
- Parameters
bucket_name (str) – The Google Cloud Storage bucket where the object is.
object_name (str) – The name of the object to check in the Google cloud storage bucket.
min_ts (datetime.datetime) – The minimum timestamp to check against.
max_ts (datetime.datetime) – The maximum timestamp to check against.
- is_updated_before(bucket_name, object_name, ts)[source]¶
Checks if an blob_name is updated before given time in Google Cloud Storage.
- Parameters
bucket_name (str) – The Google Cloud Storage bucket where the object is.
object_name (str) – The name of the object to check in the Google cloud storage bucket.
ts (datetime.datetime) – The timestamp to check against.
- delete_bucket(bucket_name, force=False, user_project=None)[source]¶
Delete a bucket object from the Google Cloud Storage.
- Parameters
bucket_name (str) – name of the bucket which will be deleted
force (bool) – false not allow to delete non empty bucket, set force=True allows to delete non empty bucket
user_project (str | None) – The identifier of the Google Cloud project to bill for the request. Required for Requester Pays buckets.
- list(bucket_name, versions=None, max_results=None, prefix=None, delimiter=None, match_glob=None, user_project=None)[source]¶
List all objects from the bucket with the given a single prefix or multiple prefixes.
- Parameters
bucket_name (str) – bucket name
versions (bool | None) – if true, list all versions of the objects
max_results (int | None) – max count of items to return in a single page of responses
prefix (str | List[str] | None) – string or list of strings which filter objects whose name begin with it/them
delimiter (str | None) – (Deprecated) filters objects based on the delimiter (for e.g ‘.csv’)
match_glob (str | None) – (Optional) filters objects based on the glob pattern given by the string (e.g,
'**/*/.json'
).user_project (str | None) – The identifier of the Google Cloud project to bill for the request. Required for Requester Pays buckets.
- Returns
a stream of object names matching the filtering criteria
- list_by_timespan(bucket_name, timespan_start, timespan_end, versions=None, max_results=None, prefix=None, delimiter=None, match_glob=None)[source]¶
List all objects from the bucket with the given string prefix that were updated in the time range.
- Parameters
bucket_name (str) – bucket name
timespan_start (datetime.datetime) – will return objects that were updated at or after this datetime (UTC)
timespan_end (datetime.datetime) – will return objects that were updated before this datetime (UTC)
versions (bool | None) – if true, list all versions of the objects
max_results (int | None) – max count of items to return in a single page of responses
prefix (str | None) – prefix string which filters objects whose name begin with this prefix
delimiter (str | None) – (Deprecated) filters objects based on the delimiter (for e.g ‘.csv’)
match_glob (str | None) – (Optional) filters objects based on the glob pattern given by the string (e.g,
'**/*/.json'
).
- Returns
a stream of object names matching the filtering criteria
- Return type
List[str]
- get_crc32c(bucket_name, object_name)[source]¶
Gets the CRC32c checksum of an object in Google Cloud Storage.
- get_md5hash(bucket_name, object_name)[source]¶
Gets the MD5 hash of an object in Google Cloud Storage.
- create_bucket(bucket_name, resource=None, storage_class='MULTI_REGIONAL', location='US', project_id=None, labels=None)[source]¶
Creates a new bucket.
Google Cloud Storage uses a flat namespace, so you can’t create a bucket with a name that is already in use.
See also
For more information, see Bucket Naming Guidelines: https://cloud.google.com/storage/docs/bucketnaming.html#requirements
- Parameters
bucket_name (str) – The name of the bucket.
resource (dict | None) – An optional dict with parameters for creating the bucket. For information on available parameters, see Cloud Storage API doc: https://cloud.google.com/storage/docs/json_api/v1/buckets/insert
storage_class (str) –
This defines how objects in the bucket are stored and determines the SLA and the cost of storage. Values include
MULTI_REGIONAL
REGIONAL
STANDARD
NEARLINE
COLDLINE
.
If this value is not specified when the bucket is created, it will default to STANDARD.
location (str) –
The location of the bucket. Object data for objects in the bucket resides in physical storage within this region. Defaults to US.
project_id (str | None) – The ID of the Google Cloud Project.
labels (dict | None) – User-provided labels, in key/value pairs.
- Returns
If successful, it returns the
id
of the bucket.- Return type
- insert_bucket_acl(bucket_name, entity, role, user_project=None)[source]¶
Creates a new ACL entry on the specified bucket_name.
See: https://cloud.google.com/storage/docs/json_api/v1/bucketAccessControls/insert
- Parameters
bucket_name (str) – Name of a bucket_name.
entity (str) – The entity holding the permission, in one of the following forms: user-userId, user-email, group-groupId, group-email, domain-domain, project-team-projectId, allUsers, allAuthenticatedUsers. See: https://cloud.google.com/storage/docs/access-control/lists#scopes
role (str) – The access permission for the entity. Acceptable values are: “OWNER”, “READER”, “WRITER”.
user_project (str | None) – (Optional) The project to be billed for this request. Required for Requester Pays buckets.
- insert_object_acl(bucket_name, object_name, entity, role, generation=None, user_project=None)[source]¶
Creates a new ACL entry on the specified object.
See: https://cloud.google.com/storage/docs/json_api/v1/objectAccessControls/insert
- Parameters
bucket_name (str) – Name of a bucket_name.
object_name (str) – Name of the object. For information about how to URL encode object names to be path safe, see: https://cloud.google.com/storage/docs/json_api/#encoding
entity (str) – The entity holding the permission, in one of the following forms: user-userId, user-email, group-groupId, group-email, domain-domain, project-team-projectId, allUsers, allAuthenticatedUsers See: https://cloud.google.com/storage/docs/access-control/lists#scopes
role (str) – The access permission for the entity. Acceptable values are: “OWNER”, “READER”.
generation (int | None) – Optional. If present, selects a specific revision of this object.
user_project (str | None) – (Optional) The project to be billed for this request. Required for Requester Pays buckets.
- compose(bucket_name, source_objects, destination_object)[source]¶
Composes a list of existing object into a new object in the same storage bucket_name.
Currently it only supports up to 32 objects that can be concatenated in a single operation
https://cloud.google.com/storage/docs/json_api/v1/objects/compose
- Parameters
bucket_name (str) – The name of the bucket containing the source objects. This is also the same bucket to store the composed destination object.
source_objects (List[str]) – The list of source objects that will be composed into a single object.
destination_object (str) – The path of the object if given.
- sync(source_bucket, destination_bucket, source_object=None, destination_object=None, recursive=True, allow_overwrite=False, delete_extra_files=False)[source]¶
Synchronizes the contents of the buckets.
Parameters
source_object
anddestination_object
describe the root sync directories. If they are not passed, the entire bucket will be synchronized. If they are passed, they should point to directories.Note
The synchronization of individual files is not supported. Only entire directories can be synchronized.
- Parameters
source_bucket (str) – The name of the bucket containing the source objects.
destination_bucket (str) – The name of the bucket containing the destination objects.
source_object (str | None) – The root sync directory in the source bucket.
destination_object (str | None) – The root sync directory in the destination bucket.
recursive (bool) – If True, subdirectories will be considered
recursive – If True, subdirectories will be considered
allow_overwrite (bool) – if True, the files will be overwritten if a mismatched file is found. By default, overwriting files is not allowed
delete_extra_files (bool) –
if True, deletes additional files from the source that not found in the destination. By default extra files are not deleted.
Note
This option can delete data quickly if you specify the wrong source/destination combination.
- Returns
none
- Return type
None
- airflow.providers.google.cloud.hooks.gcs.gcs_object_is_directory(bucket)[source]¶
Return True if given Google Cloud Storage URL (gs://<bucket>/<blob>) is a directory or empty bucket.
- airflow.providers.google.cloud.hooks.gcs.parse_json_from_gcs(gcp_conn_id, file_uri)[source]¶
Downloads and parses json file from Google cloud Storage.
- class airflow.providers.google.cloud.hooks.gcs.GCSAsyncHook(**kwargs)[source]¶
Bases:
airflow.providers.google.common.hooks.base_google.GoogleBaseAsyncHook
GCSAsyncHook run on the trigger worker, inherits from GoogleBaseHookAsync.