airflow.hooks.webhdfs_hook

Module Contents

airflow.hooks.webhdfs_hook._kerberos_security_mode[source]
airflow.hooks.webhdfs_hook.log[source]
exception airflow.hooks.webhdfs_hook.AirflowWebHDFSHookException[source]

Bases:airflow.exceptions.AirflowException

class airflow.hooks.webhdfs_hook.WebHDFSHook(webhdfs_conn_id='webhdfs_default', proxy_user=None)[source]

Bases:airflow.hooks.base_hook.BaseHook

Interact with HDFS. This class is a wrapper around the hdfscli library.

get_conn(self)[source]

Returns a hdfscli InsecureClient object.

check_for_path(self, hdfs_path)[source]

Check for the existence of a path in HDFS by querying FileStatus.

load_file(self, source, destination, overwrite=True, parallelism=1, **kwargs)[source]

Uploads a file to HDFS

Parameters
  • source (str) – Local path to file or folder. If a folder, all the files inside of it will be uploaded (note that this implies that folders empty of files will not be created remotely).

  • destination (str) – PTarget HDFS path. If it already exists and is a directory, files will be uploaded inside.

  • overwrite (bool) – Overwrite any existing file or directory.

  • parallelism (int) – Number of threads to use for parallelization. A value of 0 (or negative) uses as many threads as there are files.

  • **kwargs – Keyword arguments forwarded to upload().