airflow.providers.apache.hive.transfers.s3_to_hive

This module contains an operator to move data from an S3 bucket to Hive.

Module Contents

Classes

S3ToHiveOperator

Moves data from S3 to Hive.

class airflow.providers.apache.hive.transfers.s3_to_hive.S3ToHiveOperator(*, s3_key, field_dict, hive_table, delimiter=',', create=True, recreate=False, partition=None, headers=False, check_headers=False, wildcard_match=False, aws_conn_id='aws_default', verify=None, hive_cli_conn_id='hive_cli_default', input_compressed=False, tblproperties=None, select_expression=None, hive_auth=None, **kwargs)[source]

Bases: airflow.models.BaseOperator

Moves data from S3 to Hive.

The operator downloads a file from S3, stores the file locally before loading it into a Hive table.

If the create or recreate arguments are set to True, a CREATE TABLE and DROP TABLE statements are generated. Hive data types are inferred from the cursor’s metadata from.

Note that the table generated in Hive uses STORED AS textfile which isn’t the most efficient serialization format. If a large amount of data is loaded and/or if the tables gets queried considerably, you may want to use this operator only to stage the data into a temporary table before loading it into its final destination using a HiveOperator.

Parameters
  • s3_key (str) – The key to be retrieved from S3. (templated)

  • field_dict (dict) – A dictionary of the fields name in the file as keys and their Hive types as values

  • hive_table (str) – target Hive table, use dot notation to target a specific database. (templated)

  • delimiter (str) – field delimiter in the file

  • create (bool) – whether to create the table if it doesn’t exist

  • recreate (bool) – whether to drop and recreate the table at every execution

  • partition (dict | None) – target partition as a dict of partition columns and values. (templated)

  • headers (bool) – whether the file contains column names on the first line

  • check_headers (bool) – whether the column names on the first line should be checked against the keys of field_dict

  • wildcard_match (bool) – whether the s3_key should be interpreted as a Unix wildcard pattern

  • aws_conn_id (str | None) – source s3 connection

  • verify (bool | str | None) –

    Whether or not to verify SSL certificates for S3 connection. By default SSL certificates are verified. You can provide the following values:

    • False: do not validate SSL certificates. SSL will still be used

      (unless use_ssl is False), but SSL certificates will not be verified.

    • path/to/cert/bundle.pem: A filename of the CA cert bundle to uses.

      You can specify this argument if you want to use a different CA cert bundle than the one used by botocore.

  • hive_cli_conn_id (str) – Reference to the Hive CLI connection id.

  • input_compressed (bool) – Boolean to determine if file decompression is required to process headers

  • tblproperties (dict | None) – TBLPROPERTIES of the hive table being created

  • select_expression (str | None) – S3 Select expression

template_fields: collections.abc.Sequence[str] = ('s3_key', 'partition', 'hive_table')[source]
template_ext: collections.abc.Sequence[str] = ()[source]
ui_color = '#a0e08c'[source]
execute(context)[source]

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

Was this entry helpful?