airflow.contrib.hooks.azure_data_lake_hook
¶
Module Contents¶
-
class
airflow.contrib.hooks.azure_data_lake_hook.
AzureDataLakeHook
(azure_data_lake_conn_id='azure_data_lake_default')[source]¶ Bases:
airflow.hooks.base_hook.BaseHook
Interacts with Azure Data Lake.
Client ID and client secret should be in user and password parameters. Tenant and account name should be extra field as {“tenant”: “<TENANT>”, “account_name”: “ACCOUNT_NAME”}.
- Parameters
azure_data_lake_conn_id (str) – Reference to the Azure Data Lake connection.
-
upload_file
(self, local_path, remote_path, nthreads=64, overwrite=True, buffersize=4194304, blocksize=4194304)[source]¶ Upload a file to Azure Data Lake.
- Parameters
local_path (str) – local path. Can be single file, directory (in which case, upload recursively) or glob pattern. Recursive glob patterns using ** are not supported.
remote_path (str) – Remote path to upload to; if multiple files, this is the directory root to write within.
nthreads (int) – Number of threads to use. If None, uses the number of cores.
overwrite (bool) – Whether to forcibly overwrite existing files/directories. If False and remote path is a directory, will quit regardless if any files would be overwritten or not. If True, only matching filenames are actually overwritten.
buffersize (int) – int [2**22] Number of bytes for internal buffer. This block cannot be bigger than a chunk and cannot be smaller than a block.
blocksize (int) – int [2**22] Number of bytes for a block. Within each chunk, we write a smaller block for each API call. This block cannot be bigger than a chunk.
-
download_file
(self, local_path, remote_path, nthreads=64, overwrite=True, buffersize=4194304, blocksize=4194304)[source]¶ Download a file from Azure Blob Storage.
- Parameters
local_path (str) – local path. If downloading a single file, will write to this specific file, unless it is an existing directory, in which case a file is created within it. If downloading multiple files, this is the root directory to write within. Will create directories as required.
remote_path (str) – remote path/globstring to use to find remote files. Recursive glob patterns using ** are not supported.
nthreads (int) – Number of threads to use. If None, uses the number of cores.
overwrite (bool) – Whether to forcibly overwrite existing files/directories. If False and remote path is a directory, will quit regardless if any files would be overwritten or not. If True, only matching filenames are actually overwritten.
buffersize (int) – int [2**22] Number of bytes for internal buffer. This block cannot be bigger than a chunk and cannot be smaller than a block.
blocksize (int) – int [2**22] Number of bytes for a block. Within each chunk, we write a smaller block for each API call. This block cannot be bigger than a chunk.