Module Contents



Gathers partition statistics using a dynamically generated Presto

class airflow.providers.apache.hive.operators.hive_stats.HiveStatsCollectionOperator(*, table, partition, extra_exprs=None, excluded_columns=None, assignment_func=None, metastore_conn_id='metastore_default', presto_conn_id='presto_default', mysql_conn_id='airflow_db', **kwargs)[source]

Bases: airflow.models.BaseOperator

Gathers partition statistics using a dynamically generated Presto query, inserts the stats into a MySql table with this format. Stats overwrite themselves if you rerun the same date/partition.

CREATE TABLE hive_stats (
    ds VARCHAR(16),
    table_name VARCHAR(500),
    metric VARCHAR(200),
    value BIGINT
  • metastore_conn_id (str) – Reference to the Hive Metastore connection id.

  • table (str) – the source table, in the format database.table_name. (templated)

  • partition (Any) – the source partition. (templated)

  • extra_exprs (dict[str, Any] | None) – dict of expression to run against the table where keys are metric names and values are Presto compatible expressions

  • excluded_columns (list[str] | None) – list of columns to exclude, consider excluding blobs, large json columns, …

  • assignment_func (Callable[[str, str], dict[Any, Any] | None] | None) – a function that receives a column name and a type, and returns a dict of metric names and an Presto expressions. If None is returned, the global defaults are applied. If an empty dictionary is returned, no stats are computed for that column.

template_fields: Sequence[str] = ('table', 'partition', 'ds', 'dttm')[source]
ui_color = '#aff7a6'[source]
get_default_exprs(col, col_type)[source]

Get default expressions


This is the main method to derive when creating an operator. Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

Was this entry helpful?