airflow.providers.microsoft.azure.hooks.data_lake¶
Module Contents¶
Classes¶
| Integration with Azure Data Lake. | |
| Interact with a ADLS gen2 storage account. | 
- class airflow.providers.microsoft.azure.hooks.data_lake.AzureDataLakeHook(azure_data_lake_conn_id=default_conn_name)[source]¶
- Bases: - airflow.hooks.base.BaseHook- Integration with Azure Data Lake. - AzureDataLakeHook communicates via a REST API compatible with WebHDFS. Make sure that a Airflow connection of type - azure_data_lakeexists. Authorization can be done by supplying a login (=Client ID), password (=Client Secret), and extra fields tenant (Tenant) and account_name (Account Name). See connection- azure_data_lake_defaultfor an example.- Client ID and secret should be in user and password parameters. Tenant and account name should be extra field as - {"tenant": "<TENANT>", "account_name": "ACCOUNT_NAME"}.- Parameters
- azure_data_lake_conn_id (str) – Reference to Azure Data Lake connection. 
 - upload_file(local_path, remote_path, nthreads=64, overwrite=True, buffersize=4194304, blocksize=4194304, **kwargs)[source]¶
- Upload a file to Azure Data Lake. - Parameters
- local_path (str) – local path. Can be single file, directory (in which case, upload recursively) or glob pattern. Recursive glob patterns using ** are not supported. 
- remote_path (str) – Remote path to upload to; if multiple files, this is the directory root to write within. 
- nthreads (int) – Number of threads to use. If None, uses the number of cores. 
- overwrite (bool) – Whether to forcibly overwrite existing files/directories. If False and remote path is a directory, will quit regardless if any files would be overwritten or not. If True, only matching filenames are actually overwritten. 
- buffersize (int) – int [2**22] Number of bytes for internal buffer. This block cannot be bigger than a chunk and cannot be smaller than a block. 
- blocksize (int) – int [2**22] Number of bytes for a block. Within each chunk, we write a smaller block for each API call. This block cannot be bigger than a chunk. 
 
 
 - download_file(local_path, remote_path, nthreads=64, overwrite=True, buffersize=4194304, blocksize=4194304, **kwargs)[source]¶
- Download a file from Azure Blob Storage. - Parameters
- local_path (str) – local path. If downloading a single file, will write to this specific file, unless it is an existing directory, in which case a file is created within it. If downloading multiple files, this is the root directory to write within. Will create directories as required. 
- remote_path (str) – remote path/globstring to use to find remote files. Recursive glob patterns using ** are not supported. 
- nthreads (int) – Number of threads to use. If None, uses the number of cores. 
- overwrite (bool) – Whether to forcibly overwrite existing files/directories. If False and remote path is a directory, will quit regardless if any files would be overwritten or not. If True, only matching filenames are actually overwritten. 
- buffersize (int) – int [2**22] Number of bytes for internal buffer. This block cannot be bigger than a chunk and cannot be smaller than a block. 
- blocksize (int) – int [2**22] Number of bytes for a block. Within each chunk, we write a smaller block for each API call. This block cannot be bigger than a chunk. 
 
 
 - list(path)[source]¶
- List files in Azure Data Lake Storage. - Parameters
- path (str) – full path/globstring to use to list files in ADLS 
 
 
- class airflow.providers.microsoft.azure.hooks.data_lake.AzureDataLakeStorageV2Hook(adls_conn_id, public_read=False)[source]¶
- Bases: - airflow.hooks.base.BaseHook- Interact with a ADLS gen2 storage account. - It mainly helps to create and manage directories and files in storage accounts that have a hierarchical namespace. Using Adls_v2 connection details create DataLakeServiceClient object. - Due to Wasb is marked as legacy and and retirement of the (ADLS1), it would be nice to implement ADLS gen2 hook for interacting with the storage account. - See also - https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-directory-file-acl-python - Parameters
- adls_conn_id (str) – Reference to the adls connection. 
- public_read (bool) – Whether an anonymous public read access should be used. default is False 
 
 - create_file_system(file_system_name)[source]¶
- Create a new file system under the specified account. - A container acts as a file system for your files. - If the file system with the same name already exists, a ResourceExistsError will be raised. This method returns a client with which to interact with the newly created file system. 
 - get_file_system(file_system)[source]¶
- Get a client to interact with the specified file system. - Parameters
- file_system (azure.storage.filedatalake.FileSystemProperties | str) – This can either be the name of the file system or an instance of FileSystemProperties. 
 
 - create_directory(file_system_name, directory_name, **kwargs)[source]¶
- Create a directory under the specified file system. 
 - get_directory_client(file_system_name, directory_name)[source]¶
- Get the specific directory under the specified file system. - Parameters
- file_system_name (azure.storage.filedatalake.FileSystemProperties | str) – Name of the file system or instance of FileSystemProperties. 
- directory_name (azure.storage.filedatalake.DirectoryProperties | str) – Name of the directory or instance of DirectoryProperties which needs to be retrieved from the file system. 
 
 
 - upload_file(file_system_name, file_name, file_path, overwrite=False, **kwargs)[source]¶
- Create a file with data in the file system. - Parameters
- file_system_name (azure.storage.filedatalake.FileSystemProperties | str) – Name of the file system or instance of FileSystemProperties. 
- file_name (str) – Name of the file to be created with name. 
- file_path (str) – Path to the file to load. 
- overwrite (bool) – Boolean flag to overwrite an existing file or not. 
 
 
 - upload_file_to_directory(file_system_name, directory_name, file_name, file_path, overwrite=False, **kwargs)[source]¶
- Upload data to a file. - Parameters
- file_system_name (str) – Name of the file system or instance of FileSystemProperties. 
- directory_name (str) – Name of the directory. 
- file_name (str) – Name of the file to be created with name. 
- file_path (str) – Path to the file to load. 
- overwrite (bool) – Boolean flag to overwrite an existing file or not. 
 
 
 - list_files_directory(file_system_name, directory_name)[source]¶
- List files or directories under the specified file system. 
 - list_file_system(prefix=None, include_metadata=False, **kwargs)[source]¶
- List file systems under the specified account. 
 - delete_file_system(file_system_name)[source]¶
- Delete the file system. - Parameters
- file_system_name (azure.storage.filedatalake.FileSystemProperties | str) – Name of the file system or instance of FileSystemProperties. 
 
 
