airflow.providers.microsoft.azure.hooks.data_lake
¶
Module Contents¶
Classes¶
This module contains integration with Azure Data Lake. |
|
This Hook interacts with ADLS gen2 storage account it mainly helps to create and manage |
- class airflow.providers.microsoft.azure.hooks.data_lake.AzureDataLakeHook(azure_data_lake_conn_id=default_conn_name)[source]¶
Bases:
airflow.hooks.base.BaseHook
This module contains integration with Azure Data Lake.
AzureDataLakeHook communicates via a REST API compatible with WebHDFS. Make sure that a Airflow connection of type azure_data_lake exists. Authorization can be done by supplying a login (=Client ID), password (=Client Secret) and extra fields tenant (Tenant) and account_name (Account Name)(see connection azure_data_lake_default for an example).
Interacts with Azure Data Lake.
Client ID and client secret should be in user and password parameters. Tenant and account name should be extra field as {“tenant”: “<TENANT>”, “account_name”: “ACCOUNT_NAME”}.
- Parameters
azure_data_lake_conn_id (str) – Reference to the Azure Data Lake connection.
- upload_file(local_path, remote_path, nthreads=64, overwrite=True, buffersize=4194304, blocksize=4194304, **kwargs)[source]¶
Upload a file to Azure Data Lake.
- Parameters
local_path (str) – local path. Can be single file, directory (in which case, upload recursively) or glob pattern. Recursive glob patterns using ** are not supported.
remote_path (str) – Remote path to upload to; if multiple files, this is the directory root to write within.
nthreads (int) – Number of threads to use. If None, uses the number of cores.
overwrite (bool) – Whether to forcibly overwrite existing files/directories. If False and remote path is a directory, will quit regardless if any files would be overwritten or not. If True, only matching filenames are actually overwritten.
buffersize (int) – int [2**22] Number of bytes for internal buffer. This block cannot be bigger than a chunk and cannot be smaller than a block.
blocksize (int) – int [2**22] Number of bytes for a block. Within each chunk, we write a smaller block for each API call. This block cannot be bigger than a chunk.
- download_file(local_path, remote_path, nthreads=64, overwrite=True, buffersize=4194304, blocksize=4194304, **kwargs)[source]¶
Download a file from Azure Blob Storage.
- Parameters
local_path (str) – local path. If downloading a single file, will write to this specific file, unless it is an existing directory, in which case a file is created within it. If downloading multiple files, this is the root directory to write within. Will create directories as required.
remote_path (str) – remote path/globstring to use to find remote files. Recursive glob patterns using ** are not supported.
nthreads (int) – Number of threads to use. If None, uses the number of cores.
overwrite (bool) – Whether to forcibly overwrite existing files/directories. If False and remote path is a directory, will quit regardless if any files would be overwritten or not. If True, only matching filenames are actually overwritten.
buffersize (int) – int [2**22] Number of bytes for internal buffer. This block cannot be bigger than a chunk and cannot be smaller than a block.
blocksize (int) – int [2**22] Number of bytes for a block. Within each chunk, we write a smaller block for each API call. This block cannot be bigger than a chunk.
- list(path)[source]¶
List files in Azure Data Lake Storage
- Parameters
path (str) – full path/globstring to use to list files in ADLS
- class airflow.providers.microsoft.azure.hooks.data_lake.AzureDataLakeStorageV2Hook(adls_conn_id, public_read=False)[source]¶
Bases:
airflow.hooks.base.BaseHook
This Hook interacts with ADLS gen2 storage account it mainly helps to create and manage directories and files in storage accounts that have a hierarchical namespace. Using Adls_v2 connection details create DataLakeServiceClient object
Due to Wasb is marked as legacy and and retirement of the (ADLS1) it would be nice to implement ADLS gen2 hook for interacting with the storage account.
See also
https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-directory-file-acl-python
- Parameters
adls_conn_id (str) – Reference to the adls connection.
public_read (bool) – Whether an anonymous public read access should be used. default is False
- create_file_system(file_system_name)[source]¶
A container acts as a file system for your files. Creates a new file system under the specified account.
If the file system with the same name already exists, a ResourceExistsError will be raised. This method returns a client with which to interact with the newly created file system.
- get_file_system(file_system)[source]¶
Get a client to interact with the specified file system
- Parameters
file_system (FileSystemProperties | str) – This can either be the name of the file system or an instance of FileSystemProperties.
- create_directory(file_system_name, directory_name, **kwargs)[source]¶
Create a directory under the specified file system.
- get_directory_client(file_system_name, directory_name)[source]¶
Get the specific directory under the specified file system.
- upload_file(file_system_name, file_name, file_path, overwrite=False, **kwargs)[source]¶
Create a file with data in the file system
- upload_file_to_directory(file_system_name, directory_name, file_name, file_path, overwrite=False, **kwargs)[source]¶
Create a new file and return the file client to be interacted with and then upload data to a file
- Parameters
file_system_name (str) – Name of the file system or instance of FileSystemProperties.
directory_name (str) – Name of the directory.
file_name (str) – Name of the file to be created with name.
file_path (str) – Path to the file to load.
overwrite (bool) – Boolean flag to overwrite an existing file or not.
- list_files_directory(file_system_name, directory_name)[source]¶
Get the list of files or directories under the specified file system
- list_file_system(prefix=None, include_metadata=False, **kwargs)[source]¶
Get the list the file systems under the specified account.