airflow.providers.microsoft.azure.hooks.data_lake

Module Contents

Classes

AzureDataLakeHook

This module contains integration with Azure Data Lake.

AzureDataLakeStorageV2Hook

This Hook interacts with ADLS gen2 storage account it mainly helps to create and manage

class airflow.providers.microsoft.azure.hooks.data_lake.AzureDataLakeHook(azure_data_lake_conn_id=default_conn_name)[source]

Bases: airflow.hooks.base.BaseHook

This module contains integration with Azure Data Lake.

AzureDataLakeHook communicates via a REST API compatible with WebHDFS. Make sure that a Airflow connection of type azure_data_lake exists. Authorization can be done by supplying a login (=Client ID), password (=Client Secret) and extra fields tenant (Tenant) and account_name (Account Name)(see connection azure_data_lake_default for an example).

Interacts with Azure Data Lake.

Client ID and client secret should be in user and password parameters. Tenant and account name should be extra field as {“tenant”: “<TENANT>”, “account_name”: “ACCOUNT_NAME”}.

Parameters

azure_data_lake_conn_id (str) – Reference to the Azure Data Lake connection.

conn_name_attr = 'azure_data_lake_conn_id'[source]
default_conn_name = 'azure_data_lake_default'[source]
conn_type = 'azure_data_lake'[source]
hook_name = 'Azure Data Lake'[source]
static get_connection_form_widgets()[source]

Returns connection widgets to add to connection form

static get_ui_field_behaviour()[source]

Returns custom field behaviour

get_conn()[source]

Return a AzureDLFileSystem object.

check_for_file(file_path)[source]

Check if a file exists on Azure Data Lake.

Parameters

file_path (str) – Path and name of the file.

Returns

True if the file exists, False otherwise.

Return type

bool

upload_file(local_path, remote_path, nthreads=64, overwrite=True, buffersize=4194304, blocksize=4194304, **kwargs)[source]

Upload a file to Azure Data Lake.

Parameters
  • local_path (str) – local path. Can be single file, directory (in which case, upload recursively) or glob pattern. Recursive glob patterns using ** are not supported.

  • remote_path (str) – Remote path to upload to; if multiple files, this is the directory root to write within.

  • nthreads (int) – Number of threads to use. If None, uses the number of cores.

  • overwrite (bool) – Whether to forcibly overwrite existing files/directories. If False and remote path is a directory, will quit regardless if any files would be overwritten or not. If True, only matching filenames are actually overwritten.

  • buffersize (int) – int [2**22] Number of bytes for internal buffer. This block cannot be bigger than a chunk and cannot be smaller than a block.

  • blocksize (int) – int [2**22] Number of bytes for a block. Within each chunk, we write a smaller block for each API call. This block cannot be bigger than a chunk.

download_file(local_path, remote_path, nthreads=64, overwrite=True, buffersize=4194304, blocksize=4194304, **kwargs)[source]

Download a file from Azure Blob Storage.

Parameters
  • local_path (str) – local path. If downloading a single file, will write to this specific file, unless it is an existing directory, in which case a file is created within it. If downloading multiple files, this is the root directory to write within. Will create directories as required.

  • remote_path (str) – remote path/globstring to use to find remote files. Recursive glob patterns using ** are not supported.

  • nthreads (int) – Number of threads to use. If None, uses the number of cores.

  • overwrite (bool) – Whether to forcibly overwrite existing files/directories. If False and remote path is a directory, will quit regardless if any files would be overwritten or not. If True, only matching filenames are actually overwritten.

  • buffersize (int) – int [2**22] Number of bytes for internal buffer. This block cannot be bigger than a chunk and cannot be smaller than a block.

  • blocksize (int) – int [2**22] Number of bytes for a block. Within each chunk, we write a smaller block for each API call. This block cannot be bigger than a chunk.

list(path)[source]

List files in Azure Data Lake Storage

Parameters

path (str) – full path/globstring to use to list files in ADLS

remove(path, recursive=False, ignore_not_found=True)[source]

Remove files in Azure Data Lake Storage

Parameters
  • path (str) – A directory or file to remove in ADLS

  • recursive (bool) – Whether to loop into directories in the location and remove the files

  • ignore_not_found (bool) – Whether to raise error if file to delete is not found

class airflow.providers.microsoft.azure.hooks.data_lake.AzureDataLakeStorageV2Hook(adls_conn_id, public_read=False)[source]

Bases: airflow.hooks.base.BaseHook

This Hook interacts with ADLS gen2 storage account it mainly helps to create and manage directories and files in storage accounts that have a hierarchical namespace. Using Adls_v2 connection details create DataLakeServiceClient object

Due to Wasb is marked as legacy and and retirement of the (ADLS1) it would be nice to implement ADLS gen2 hook for interacting with the storage account.

Parameters
  • adls_conn_id (str) – Reference to the adls connection.

  • public_read (bool) – Whether an anonymous public read access should be used. default is False

conn_name_attr = 'adls_conn_id'[source]
default_conn_name = 'adls_default'[source]
conn_type = 'adls'[source]
hook_name = 'Azure Date Lake Storage V2'[source]
static get_connection_form_widgets()[source]

Returns connection widgets to add to connection form

static get_ui_field_behaviour()[source]

Returns custom field behaviour

get_conn()[source]

Return the DataLakeServiceClient object.

create_file_system(file_system_name)[source]

A container acts as a file system for your files. Creates a new file system under the specified account.

If the file system with the same name already exists, a ResourceExistsError will be raised. This method returns a client with which to interact with the newly created file system.

get_file_system(file_system)[source]

Get a client to interact with the specified file system

Parameters

file_system (FileSystemProperties | str) – This can either be the name of the file system or an instance of FileSystemProperties.

create_directory(file_system_name, directory_name, **kwargs)[source]

Create a directory under the specified file system.

Parameters
  • file_system_name (FileSystemProperties | str) – Name of the file system or instance of FileSystemProperties.

  • directory_name (str) – Name of the directory which needs to be created in the file system.

get_directory_client(file_system_name, directory_name)[source]

Get the specific directory under the specified file system.

Parameters
  • file_system_name (FileSystemProperties | str) – Name of the file system or instance of FileSystemProperties.

  • directory_name (DirectoryProperties | str) – Name of the directory or instance of DirectoryProperties which needs to be retrieved from the file system.

create_file(file_system_name, file_name)[source]

Creates a file under the file system

Parameters
  • file_system_name (FileSystemProperties | str) – Name of the file system or instance of FileSystemProperties.

  • file_name (str) – Name of the file which needs to be created in the file system.

upload_file(file_system_name, file_name, file_path, overwrite=False, **kwargs)[source]

Create a file with data in the file system

Parameters
  • file_system_name (FileSystemProperties | str) – Name of the file system or instance of FileSystemProperties.

  • file_name (str) – Name of the file to be created with name.

  • file_path (str) – Path to the file to load.

  • overwrite (bool) – Boolean flag to overwrite an existing file or not.

upload_file_to_directory(file_system_name, directory_name, file_name, file_path, overwrite=False, **kwargs)[source]

Create a new file and return the file client to be interacted with and then upload data to a file

Parameters
  • file_system_name (str) – Name of the file system or instance of FileSystemProperties.

  • directory_name (str) – Name of the directory.

  • file_name (str) – Name of the file to be created with name.

  • file_path (str) – Path to the file to load.

  • overwrite (bool) – Boolean flag to overwrite an existing file or not.

list_files_directory(file_system_name, directory_name)[source]

Get the list of files or directories under the specified file system

Parameters
  • file_system_name (FileSystemProperties | str) – Name of the file system or instance of FileSystemProperties.

  • directory_name (str) – Name of the directory.

list_file_system(prefix=None, include_metadata=False, **kwargs)[source]

Get the list the file systems under the specified account.

Parameters
  • prefix (str | None) – Filters the results to return only file systems whose names begin with the specified prefix.

  • include_metadata (bool) – Specifies that file system metadata be returned in the response. The default value is False.

delete_file_system(file_system_name)[source]

Deletes the file system

Parameters

file_system_name (FileSystemProperties | str) – Name of the file system or instance of FileSystemProperties.

delete_directory(file_system_name, directory_name)[source]

Deletes specified directory in file system

Parameters
  • file_system_name (FileSystemProperties | str) – Name of the file system or instance of FileSystemProperties.

  • directory_name (str) – Name of the directory.

Was this entry helpful?