airflow.providers.microsoft.azure.hooks.data_lake

Module Contents

Classes

AzureDataLakeHook

Integration with Azure Data Lake.

AzureDataLakeStorageV2Hook

Interact with a ADLS gen2 storage account.

Attributes

Credentials

airflow.providers.microsoft.azure.hooks.data_lake.Credentials[source]
class airflow.providers.microsoft.azure.hooks.data_lake.AzureDataLakeHook(azure_data_lake_conn_id=default_conn_name)[source]

Bases: airflow.hooks.base.BaseHook

Integration with Azure Data Lake.

AzureDataLakeHook communicates via a REST API compatible with WebHDFS. Make sure that a Airflow connection of type azure_data_lake exists. Authorization can be done by supplying a login (=Client ID), password (=Client Secret), and extra fields tenant (Tenant) and account_name (Account Name). See connection azure_data_lake_default for an example.

Client ID and secret should be in user and password parameters. Tenant and account name should be extra field as {"tenant": "<TENANT>", "account_name": "ACCOUNT_NAME"}.

Parameters

azure_data_lake_conn_id (str) – Reference to Azure Data Lake connection.

conn_name_attr = 'azure_data_lake_conn_id'[source]
default_conn_name = 'azure_data_lake_default'[source]
conn_type = 'azure_data_lake'[source]
hook_name = 'Azure Data Lake'[source]
classmethod get_connection_form_widgets()[source]

Returns connection widgets to add to connection form.

classmethod get_ui_field_behaviour()[source]

Returns custom field behaviour.

get_conn()[source]

Return a AzureDLFileSystem object.

check_for_file(file_path)[source]

Check if a file exists on Azure Data Lake.

Parameters

file_path (str) – Path and name of the file.

Returns

True if the file exists, False otherwise.

Return type

bool

upload_file(local_path, remote_path, nthreads=64, overwrite=True, buffersize=4194304, blocksize=4194304, **kwargs)[source]

Upload a file to Azure Data Lake.

Parameters
  • local_path (str) – local path. Can be single file, directory (in which case, upload recursively) or glob pattern. Recursive glob patterns using ** are not supported.

  • remote_path (str) – Remote path to upload to; if multiple files, this is the directory root to write within.

  • nthreads (int) – Number of threads to use. If None, uses the number of cores.

  • overwrite (bool) – Whether to forcibly overwrite existing files/directories. If False and remote path is a directory, will quit regardless if any files would be overwritten or not. If True, only matching filenames are actually overwritten.

  • buffersize (int) – int [2**22] Number of bytes for internal buffer. This block cannot be bigger than a chunk and cannot be smaller than a block.

  • blocksize (int) – int [2**22] Number of bytes for a block. Within each chunk, we write a smaller block for each API call. This block cannot be bigger than a chunk.

download_file(local_path, remote_path, nthreads=64, overwrite=True, buffersize=4194304, blocksize=4194304, **kwargs)[source]

Download a file from Azure Blob Storage.

Parameters
  • local_path (str) – local path. If downloading a single file, will write to this specific file, unless it is an existing directory, in which case a file is created within it. If downloading multiple files, this is the root directory to write within. Will create directories as required.

  • remote_path (str) – remote path/globstring to use to find remote files. Recursive glob patterns using ** are not supported.

  • nthreads (int) – Number of threads to use. If None, uses the number of cores.

  • overwrite (bool) – Whether to forcibly overwrite existing files/directories. If False and remote path is a directory, will quit regardless if any files would be overwritten or not. If True, only matching filenames are actually overwritten.

  • buffersize (int) – int [2**22] Number of bytes for internal buffer. This block cannot be bigger than a chunk and cannot be smaller than a block.

  • blocksize (int) – int [2**22] Number of bytes for a block. Within each chunk, we write a smaller block for each API call. This block cannot be bigger than a chunk.

list(path)[source]

List files in Azure Data Lake Storage.

Parameters

path (str) – full path/globstring to use to list files in ADLS

remove(path, recursive=False, ignore_not_found=True)[source]

Remove files in Azure Data Lake Storage.

Parameters
  • path (str) – A directory or file to remove in ADLS

  • recursive (bool) – Whether to loop into directories in the location and remove the files

  • ignore_not_found (bool) – Whether to raise error if file to delete is not found

class airflow.providers.microsoft.azure.hooks.data_lake.AzureDataLakeStorageV2Hook(adls_conn_id, public_read=False)[source]

Bases: airflow.hooks.base.BaseHook

Interact with a ADLS gen2 storage account.

It mainly helps to create and manage directories and files in storage accounts that have a hierarchical namespace. Using Adls_v2 connection details create DataLakeServiceClient object.

Due to Wasb is marked as legacy and retirement of the (ADLS1), it would be nice to implement ADLS gen2 hook for interacting with the storage account.

Parameters
  • adls_conn_id (str) – Reference to the adls connection.

  • public_read (bool) – Whether an anonymous public read access should be used. default is False

conn_name_attr = 'adls_conn_id'[source]
default_conn_name = 'adls_default'[source]
conn_type = 'adls'[source]
hook_name = 'Azure Date Lake Storage V2'[source]
classmethod get_connection_form_widgets()[source]

Returns connection widgets to add to connection form.

classmethod get_ui_field_behaviour()[source]

Returns custom field behaviour.

service_client()[source]

Return the DataLakeServiceClient object (cached).

get_conn()[source]

Return the DataLakeServiceClient object.

create_file_system(file_system_name)[source]

Create a new file system under the specified account.

A container acts as a file system for your files.

If the file system with the same name already exists, a ResourceExistsError will be raised. This method returns a client with which to interact with the newly created file system.

get_file_system(file_system)[source]

Get a client to interact with the specified file system.

Parameters

file_system (azure.storage.filedatalake.FileSystemProperties | str) – This can either be the name of the file system or an instance of FileSystemProperties.

create_directory(file_system_name, directory_name, **kwargs)[source]

Create a directory under the specified file system.

Parameters
  • file_system_name (azure.storage.filedatalake.FileSystemProperties | str) – Name of the file system or instance of FileSystemProperties.

  • directory_name (str) – Name of the directory which needs to be created in the file system.

get_directory_client(file_system_name, directory_name)[source]

Get the specific directory under the specified file system.

Parameters
  • file_system_name (azure.storage.filedatalake.FileSystemProperties | str) – Name of the file system or instance of FileSystemProperties.

  • directory_name (azure.storage.filedatalake.DirectoryProperties | str) – Name of the directory or instance of DirectoryProperties which needs to be retrieved from the file system.

create_file(file_system_name, file_name)[source]

Create a file under the file system.

Parameters
  • file_system_name (azure.storage.filedatalake.FileSystemProperties | str) – Name of the file system or instance of FileSystemProperties.

  • file_name (str) – Name of the file which needs to be created in the file system.

upload_file(file_system_name, file_name, file_path, overwrite=False, **kwargs)[source]

Create a file with data in the file system.

Parameters
  • file_system_name (azure.storage.filedatalake.FileSystemProperties | str) – Name of the file system or instance of FileSystemProperties.

  • file_name (str) – Name of the file to be created with name.

  • file_path (str) – Path to the file to load.

  • overwrite (bool) – Boolean flag to overwrite an existing file or not.

upload_file_to_directory(file_system_name, directory_name, file_name, file_path, overwrite=False, **kwargs)[source]

Upload data to a file.

Parameters
  • file_system_name (str) – Name of the file system or instance of FileSystemProperties.

  • directory_name (str) – Name of the directory.

  • file_name (str) – Name of the file to be created with name.

  • file_path (str) – Path to the file to load.

  • overwrite (bool) – Boolean flag to overwrite an existing file or not.

list_files_directory(file_system_name, directory_name)[source]

List files or directories under the specified file system.

Parameters
  • file_system_name (azure.storage.filedatalake.FileSystemProperties | str) – Name of the file system or instance of FileSystemProperties.

  • directory_name (str) – Name of the directory.

list_file_system(prefix=None, include_metadata=False, **kwargs)[source]

List file systems under the specified account.

Parameters
  • prefix (str | None) – Filters the results to return only file systems whose names begin with the specified prefix.

  • include_metadata (bool) – Specifies that file system metadata be returned in the response. The default value is False.

delete_file_system(file_system_name)[source]

Delete the file system.

Parameters

file_system_name (azure.storage.filedatalake.FileSystemProperties | str) – Name of the file system or instance of FileSystemProperties.

delete_directory(file_system_name, directory_name)[source]

Delete the specified directory in a file system.

Parameters
  • file_system_name (azure.storage.filedatalake.FileSystemProperties | str) – Name of the file system or instance of FileSystemProperties.

  • directory_name (str) – Name of the directory.

test_connection()[source]

Test ADLS Gen2 Storage connection.

Was this entry helpful?