airflow.operators.hive_stats_operator

Module Contents

class airflow.operators.hive_stats_operator.HiveStatsCollectionOperator(table, partition, extra_exprs=None, col_blacklist=None, assignment_func=None, metastore_conn_id='metastore_default', presto_conn_id='presto_default', mysql_conn_id='airflow_db', *args, **kwargs)[source]

Bases: airflow.models.BaseOperator

Gathers partition statistics using a dynamically generated Presto query, inserts the stats into a MySql table with this format. Stats overwrite themselves if you rerun the same date/partition.

CREATE TABLE hive_stats (
    ds VARCHAR(16),
    table_name VARCHAR(500),
    metric VARCHAR(200),
    value BIGINT
);
Parameters
  • table (str) – the source table, in the format database.table_name. (templated)

  • partition (dict of {col:value}) – the source partition. (templated)

  • extra_exprs (dict) – dict of expression to run against the table where keys are metric names and values are Presto compatible expressions

  • col_blacklist (list) – list of columns to blacklist, consider blacklisting blobs, large json columns, …

  • assignment_func (function) – a function that receives a column name and a type, and returns a dict of metric names and an Presto expressions. If None is returned, the global defaults are applied. If an empty dictionary is returned, no stats are computed for that column.

template_fields = ['table', 'partition', 'ds', 'dttm'][source]
ui_color = #aff7a6[source]
get_default_exprs(self, col, col_type)[source]
execute(self, context=None)[source]

Was this entry helpful?