airflow.providers.apache.hdfs.hooks.webhdfs

Hook for Web HDFS.

Module Contents

Classes

WebHDFSHook

Interact with HDFS. This class is a wrapper around the hdfscli library.

Attributes

log

airflow.providers.apache.hdfs.hooks.webhdfs.log[source]
exception airflow.providers.apache.hdfs.hooks.webhdfs.AirflowWebHDFSHookException[source]

Bases: airflow.exceptions.AirflowException

Exception specific for WebHDFS hook.

class airflow.providers.apache.hdfs.hooks.webhdfs.WebHDFSHook(webhdfs_conn_id=default_conn_name, proxy_user=None)[source]

Bases: airflow.hooks.base.BaseHook

Interact with HDFS. This class is a wrapper around the hdfscli library.

Parameters
  • webhdfs_conn_id (str) – The connection id for the webhdfs client to connect to.

  • proxy_user (str | None) – The user used to authenticate.

conn_type = 'webhdfs'[source]
conn_name_attr = 'webhdfs_conn_id'[source]
default_conn_name = 'webhdfs_default'[source]
hook_name = 'Apache WebHDFS'[source]
get_conn()[source]

Establish a connection depending on the security mode set via config or environment variable.

Returns

a hdfscli InsecureClient or KerberosClient object.

Return type

Any

check_for_path(hdfs_path)[source]

Check for the existence of a path in HDFS by querying FileStatus.

Parameters

hdfs_path (str) – The path to check.

Returns

True if the path exists and False if not.

Return type

bool

load_file(source, destination, overwrite=True, parallelism=1, **kwargs)[source]

Upload a file to HDFS.

Parameters
  • source (str) – Local path to file or folder. If it’s a folder, all the files inside it will be uploaded. .. note:: This implies that folders empty of files will not be created remotely.

  • destination (str) – PTarget HDFS path. If it already exists and is a directory, files will be uploaded inside.

  • overwrite (bool) – Overwrite any existing file or directory.

  • parallelism (int) – Number of threads to use for parallelization. A value of 0 (or negative) uses as many threads as there are files.

  • kwargs (Any) – Keyword arguments forwarded to hdfs.client.Client.upload().

read_file(filename)[source]

Read a file from HDFS.

Parameters

filename (str) – The path of the file to read.

Returns

File content as a raw string

Return type

bytes

Was this entry helpful?