airflow.providers.apache.hive.hooks.hive
¶
Module Contents¶
-
airflow.providers.apache.hive.hooks.hive.
HIVE_QUEUE_PRIORITIES
= ['VERY_HIGH', 'HIGH', 'NORMAL', 'LOW', 'VERY_LOW'][source]¶
-
airflow.providers.apache.hive.hooks.hive.
get_context_from_env_var
() → Dict[Any, Any][source]¶ -
Extract context from env variable, e.g. dag_id, task_id and execution_date,
-
so that they can be used inside BashOperator and PythonOperator.
- Returns
The context of interest.
-
class
airflow.providers.apache.hive.hooks.hive.
HiveCliHook
(hive_cli_conn_id: str = default_conn_name, run_as: Optional[str] = None, mapred_queue: Optional[str] = None, mapred_queue_priority: Optional[str] = None, mapred_job_name: Optional[str] = None)[source]¶ Bases:
airflow.hooks.base.BaseHook
Simple wrapper around the hive CLI.
It also supports the
beeline
a lighter CLI that runs JDBC and is replacing the heavier traditional CLI. To enablebeeline
, set the use_beeline param in the extra field of your connection as in{ "use_beeline": true }
Note that you can also set default hive CLI parameters using the
hive_cli_params
to be used in your connection as in{"hive_cli_params": "-hiveconf mapred.job.tracker=some.jobtracker:444"}
Parameters passed here can be overridden by run_cli's hive_conf paramThe extra connection parameter
auth
gets passed as in thejdbc
connection string as is.- Parameters
mapred_queue (str) -- queue used by the Hadoop Scheduler (Capacity or Fair)
mapred_queue_priority (str) -- priority within the job queue. Possible settings include: VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW
mapred_job_name (str) -- This name will appear in the jobtracker. This can make monitoring easier.
-
_get_proxy_user
(self)[source]¶ This function set the proper proxy_user value in case the user overwrite the default.
-
static
_prepare_hiveconf
(d: Dict[Any, Any])[source]¶ This function prepares a list of hiveconf params from a dictionary of key value pairs.
- Parameters
d (dict) --
-
run_cli
(self, hql: Union[str, str], schema: Optional[str] = None, verbose: bool = True, hive_conf: Optional[Dict[Any, Any]] = None)[source]¶ Run an hql statement using the hive cli. If hive_conf is specified it should be a dict and the entries will be set as key/value pairs in HiveConf
- Parameters
hive_conf (dict) -- if specified these key value pairs will be passed to hive as
-hiveconf "key"="value"
. Note that they will be passed after thehive_cli_params
and thus will override whatever values are specified in the database.
-
load_df
(self, df: pandas.DataFrame, table: str, field_dict: Optional[Dict[Any, Any]] = None, delimiter: str = ',', encoding: str = 'utf8', pandas_kwargs: Any = None, **kwargs)[source]¶ Loads a pandas DataFrame into hive.
Hive data types will be inferred if not passed but column names will not be sanitized.
- Parameters
df (pandas.DataFrame) -- DataFrame to load into a Hive table
table (str) -- target Hive table, use dot notation to target a specific database
field_dict (collections.OrderedDict) -- mapping from column name to hive data type. Note that it must be OrderedDict so as to keep columns' order.
delimiter (str) -- field delimiter in the file
encoding (str) -- str encoding to use when writing DataFrame to file
pandas_kwargs (dict) -- passed to DataFrame.to_csv
kwargs -- passed to self.load_file
-
load_file
(self, filepath: str, table: str, delimiter: str = ',', field_dict: Optional[Dict[Any, Any]] = None, create: bool = True, overwrite: bool = True, partition: Optional[Dict[str, Any]] = None, recreate: bool = False, tblproperties: Optional[Dict[str, Any]] = None)[source]¶ Loads a local file into Hive
Note that the table generated in Hive uses
STORED AS textfile
which isn't the most efficient serialization format. If a large amount of data is loaded and/or if the tables gets queried considerably, you may want to use this operator only to stage the data into a temporary table before loading it into its final destination using aHiveOperator
.- Parameters
filepath (str) -- local filepath of the file to load
table (str) -- target Hive table, use dot notation to target a specific database
delimiter (str) -- field delimiter in the file
field_dict (collections.OrderedDict) -- A dictionary of the fields name in the file as keys and their Hive types as values. Note that it must be OrderedDict so as to keep columns' order.
create (bool) -- whether to create the table if it doesn't exist
overwrite (bool) -- whether to overwrite the data in table or partition
partition (dict) -- target partition as a dict of partition columns and values
recreate (bool) -- whether to drop and recreate the table at every execution
tblproperties (dict) -- TBLPROPERTIES of the hive table being created
-
class
airflow.providers.apache.hive.hooks.hive.
HiveMetastoreHook
(metastore_conn_id: str = default_conn_name)[source]¶ Bases:
airflow.hooks.base.BaseHook
Wrapper to interact with the Hive Metastore
-
check_for_partition
(self, schema: str, table: str, partition: str)[source]¶ Checks whether a partition exists
-
check_for_named_partition
(self, schema: str, table: str, partition_name: str)[source]¶ Checks whether a partition with a given name exists
-
get_partitions
(self, schema: str, table_name: str, partition_filter: Optional[str] = None)[source]¶ Returns a list of all partitions in a table. Works only for tables with less than 32767 (java short max val). For subpartitioned table, the number might easily exceed this.
-
static
_get_max_partition_from_part_specs
(part_specs: List[Any], partition_key: Optional[str], filter_map: Optional[Dict[str, Any]])[source]¶ Helper method to get max partition of partitions with partition_key from part specs. key:value pair in filter_map will be used to filter out partitions.
- Parameters
part_specs (list) -- list of partition specs.
partition_key (str) -- partition key name.
filter_map (map) -- partition_key:partition_value map used for partition filtering, e.g. {'key1': 'value1', 'key2': 'value2'}. Only partitions matching all partition_key:partition_value pairs will be considered as candidates of max partition.
- Returns
Max partition or None if part_specs is empty.
- Return type
basestring
-
max_partition
(self, schema: str, table_name: str, field: Optional[str] = None, filter_map: Optional[Dict[Any, Any]] = None)[source]¶ Returns the maximum value for all partitions with given field in a table. If only one partition key exist in the table, the key will be used as field. filter_map should be a partition_key:partition_value map and will be used to filter out partitions.
-
-
class
airflow.providers.apache.hive.hooks.hive.
HiveServer2Hook
[source]¶ Bases:
airflow.hooks.dbapi.DbApiHook
Wrapper around the pyhive library
Notes: * the default authMechanism is PLAIN, to override it you can specify it in the
extra
of your connection in the UI * the default for run_set_variable_statements is true, if you are using impala you may need to set it to false in theextra
of your connection in the UI-
_get_results
(self, hql: Union[str, str, List[str]], schema: str = 'default', fetch_size: Optional[int] = None, hive_conf: Optional[Dict[Any, Any]] = None)[source]¶
-
get_results
(self, hql: Union[str, str], schema: str = 'default', fetch_size: Optional[int] = None, hive_conf: Optional[Dict[Any, Any]] = None)[source]¶ Get results of the provided hql in target schema.
- Parameters
- Returns
results of hql execution, dict with data (list of results) and header
- Return type
-
to_csv
(self, hql: Union[str, str], csv_filepath: str, schema: str = 'default', delimiter: str = ',', lineterminator: str = '\r\n', output_header: bool = True, fetch_size: int = 1000, hive_conf: Optional[Dict[Any, Any]] = None)[source]¶ Execute hql in target schema and write results to a csv file.
- Parameters
csv_filepath (str) -- filepath of csv to write results into.
schema (str) -- target schema, default to 'default'.
delimiter (str) -- delimiter of the csv file, default to ','.
lineterminator (str) -- lineterminator of the csv file.
output_header (bool) -- header of the csv file, default to True.
fetch_size (int) -- number of result rows to write into the csv file, default to 1000.
hive_conf (dict) -- hive_conf to execute alone with the hql.
-
get_records
(self, hql: Union[str, str], schema: str = 'default', hive_conf: Optional[Dict[Any, Any]] = None)[source]¶ Get a set of records from a Hive query.
-