airflow.providers.microsoft.azure.hooks.data_lake
¶
Module Contents¶
Classes¶
Integration with Azure Data Lake. |
|
Interact with a ADLS gen2 storage account. |
Attributes¶
- class airflow.providers.microsoft.azure.hooks.data_lake.AzureDataLakeHook(azure_data_lake_conn_id=default_conn_name)[source]¶
Bases:
airflow.hooks.base.BaseHook
Integration with Azure Data Lake.
AzureDataLakeHook communicates via a REST API compatible with WebHDFS. Make sure that a Airflow connection of type
azure_data_lake
exists. Authorization can be done by supplying a login (=Client ID), password (=Client Secret), and extra fields tenant (Tenant) and account_name (Account Name). See connectionazure_data_lake_default
for an example.Client ID and secret should be in user and password parameters. Tenant and account name should be extra field as
{"tenant": "<TENANT>", "account_name": "ACCOUNT_NAME"}
.- Parameters
azure_data_lake_conn_id (str) – Reference to Azure Data Lake connection.
- upload_file(local_path, remote_path, nthreads=64, overwrite=True, buffersize=4194304, blocksize=4194304, **kwargs)[source]¶
Upload a file to Azure Data Lake.
- Parameters
local_path (str) – local path. Can be single file, directory (in which case, upload recursively) or glob pattern. Recursive glob patterns using ** are not supported.
remote_path (str) – Remote path to upload to; if multiple files, this is the directory root to write within.
nthreads (int) – Number of threads to use. If None, uses the number of cores.
overwrite (bool) – Whether to forcibly overwrite existing files/directories. If False and remote path is a directory, will quit regardless if any files would be overwritten or not. If True, only matching filenames are actually overwritten.
buffersize (int) – int [2**22] Number of bytes for internal buffer. This block cannot be bigger than a chunk and cannot be smaller than a block.
blocksize (int) – int [2**22] Number of bytes for a block. Within each chunk, we write a smaller block for each API call. This block cannot be bigger than a chunk.
- download_file(local_path, remote_path, nthreads=64, overwrite=True, buffersize=4194304, blocksize=4194304, **kwargs)[source]¶
Download a file from Azure Blob Storage.
- Parameters
local_path (str) – local path. If downloading a single file, will write to this specific file, unless it is an existing directory, in which case a file is created within it. If downloading multiple files, this is the root directory to write within. Will create directories as required.
remote_path (str) – remote path/globstring to use to find remote files. Recursive glob patterns using ** are not supported.
nthreads (int) – Number of threads to use. If None, uses the number of cores.
overwrite (bool) – Whether to forcibly overwrite existing files/directories. If False and remote path is a directory, will quit regardless if any files would be overwritten or not. If True, only matching filenames are actually overwritten.
buffersize (int) – int [2**22] Number of bytes for internal buffer. This block cannot be bigger than a chunk and cannot be smaller than a block.
blocksize (int) – int [2**22] Number of bytes for a block. Within each chunk, we write a smaller block for each API call. This block cannot be bigger than a chunk.
- list(path)[source]¶
List files in Azure Data Lake Storage.
- Parameters
path (str) – full path/globstring to use to list files in ADLS
- class airflow.providers.microsoft.azure.hooks.data_lake.AzureDataLakeStorageV2Hook(adls_conn_id, public_read=False)[source]¶
Bases:
airflow.hooks.base.BaseHook
Interact with a ADLS gen2 storage account.
It mainly helps to create and manage directories and files in storage accounts that have a hierarchical namespace. Using Adls_v2 connection details create DataLakeServiceClient object.
Due to Wasb is marked as legacy and retirement of the (ADLS1), it would be nice to implement ADLS gen2 hook for interacting with the storage account.
See also
https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-directory-file-acl-python
- Parameters
adls_conn_id (str) – Reference to the adls connection.
public_read (bool) – Whether an anonymous public read access should be used. default is False
- classmethod get_connection_form_widgets()[source]¶
Returns connection widgets to add to connection form.
- create_file_system(file_system_name)[source]¶
Create a new file system under the specified account.
A container acts as a file system for your files.
If the file system with the same name already exists, a ResourceExistsError will be raised. This method returns a client with which to interact with the newly created file system.
- get_file_system(file_system)[source]¶
Get a client to interact with the specified file system.
- Parameters
file_system (azure.storage.filedatalake.FileSystemProperties | str) – This can either be the name of the file system or an instance of FileSystemProperties.
- create_directory(file_system_name, directory_name, **kwargs)[source]¶
Create a directory under the specified file system.
- get_directory_client(file_system_name, directory_name)[source]¶
Get the specific directory under the specified file system.
- Parameters
file_system_name (azure.storage.filedatalake.FileSystemProperties | str) – Name of the file system or instance of FileSystemProperties.
directory_name (azure.storage.filedatalake.DirectoryProperties | str) – Name of the directory or instance of DirectoryProperties which needs to be retrieved from the file system.
- upload_file(file_system_name, file_name, file_path, overwrite=False, **kwargs)[source]¶
Create a file with data in the file system.
- Parameters
file_system_name (azure.storage.filedatalake.FileSystemProperties | str) – Name of the file system or instance of FileSystemProperties.
file_name (str) – Name of the file to be created with name.
file_path (str) – Path to the file to load.
overwrite (bool) – Boolean flag to overwrite an existing file or not.
- upload_file_to_directory(file_system_name, directory_name, file_name, file_path, overwrite=False, **kwargs)[source]¶
Upload data to a file.
- Parameters
file_system_name (str) – Name of the file system or instance of FileSystemProperties.
directory_name (str) – Name of the directory.
file_name (str) – Name of the file to be created with name.
file_path (str) – Path to the file to load.
overwrite (bool) – Boolean flag to overwrite an existing file or not.
- list_files_directory(file_system_name, directory_name)[source]¶
List files or directories under the specified file system.
- list_file_system(prefix=None, include_metadata=False, **kwargs)[source]¶
List file systems under the specified account.
- delete_file_system(file_system_name)[source]¶
Delete the file system.
- Parameters
file_system_name (azure.storage.filedatalake.FileSystemProperties | str) – Name of the file system or instance of FileSystemProperties.