Module Contents

airflow.operators.hive_to_druid.LOAD_CHECK_INTERVAL = 5[source]
airflow.operators.hive_to_druid.DEFAULT_TARGET_PARTITION_SIZE = 5000000[source]
class airflow.operators.hive_to_druid.HiveToDruidTransfer(sql, druid_datasource, ts_dim, metric_spec=None, hive_cli_conn_id='hive_cli_default', druid_ingest_conn_id='druid_ingest_default', metastore_conn_id='metastore_default', hadoop_dependency_coordinates=None, intervals=None, num_shards=-1, target_partition_size=-1, query_granularity='NONE', segment_granularity='DAY', hive_tblproperties=None, job_properties=None, *args, **kwargs)[source]

Bases: airflow.models.BaseOperator

Moves data from Hive to Druid, [del]note that for now the data is loaded into memory before being pushed to Druid, so this operator should be used for smallish amount of data.[/del]

  • sql (str) – SQL query to execute against the Druid database. (templated)

  • druid_datasource (str) – the datasource you want to ingest into in druid

  • ts_dim (str) – the timestamp dimension

  • metric_spec (list) – the metrics you want to define for your data

  • hive_cli_conn_id (str) – the hive connection id

  • druid_ingest_conn_id (str) – the druid ingest connection id

  • metastore_conn_id (str) – the metastore connection id

  • hadoop_dependency_coordinates (list[str]) – list of coordinates to squeeze int the ingest json

  • intervals (list) – list of time intervals that defines segments, this is passed as is to the json object. (templated)

  • hive_tblproperties (dict) – additional properties for tblproperties in hive for the staging table

  • job_properties (dict) – additional properties for job

template_fields = ['sql', 'intervals'][source]
template_ext = ['.sql'][source]
execute(self, context)[source]
construct_ingest_query(self, static_path, columns)[source]

Builds an ingest query for an HDFS TSV load.

  • static_path (str) – The path on hdfs where the data is

  • columns (list) – List of all the columns that are available