airflow.operators.s3_to_hive_operator

Module Contents

class airflow.operators.s3_to_hive_operator.S3ToHiveTransfer(s3_key, field_dict, hive_table, delimiter=', ', create=True, recreate=False, partition=None, headers=False, check_headers=False, wildcard_match=False, aws_conn_id='aws_default', verify=None, hive_cli_conn_id='hive_cli_default', input_compressed=False, tblproperties=None, select_expression=None, *args, **kwargs)[source]

Bases: airflow.models.BaseOperator

Moves data from S3 to Hive. The operator downloads a file from S3, stores the file locally before loading it into a Hive table. If the create or recreate arguments are set to True, a CREATE TABLE and DROP TABLE statements are generated. Hive data types are inferred from the cursor’s metadata from.

Note that the table generated in Hive uses STORED AS textfile which isn’t the most efficient serialization format. If a large amount of data is loaded and/or if the tables gets queried considerably, you may want to use this operator only to stage the data into a temporary table before loading it into its final destination using a HiveOperator.

Parameters
  • s3_key (str) – The key to be retrieved from S3. (templated)

  • field_dict (dict) – A dictionary of the fields name in the file as keys and their Hive types as values

  • hive_table (str) – target Hive table, use dot notation to target a specific database. (templated)

  • create (bool) – whether to create the table if it doesn’t exist

  • recreate (bool) – whether to drop and recreate the table at every execution

  • partition (dict) – target partition as a dict of partition columns and values. (templated)

  • headers (bool) – whether the file contains column names on the first line

  • check_headers (bool) – whether the column names on the first line should be checked against the keys of field_dict

  • wildcard_match (bool) – whether the s3_key should be interpreted as a Unix wildcard pattern

  • delimiter (str) – field delimiter in the file

  • aws_conn_id (str) – source s3 connection

  • verify (bool or str) –

    Whether or not to verify SSL certificates for S3 connection. By default SSL certificates are verified. You can provide the following values:

    • False: do not validate SSL certificates. SSL will still be used

      (unless use_ssl is False), but SSL certificates will not be verified.

    • path/to/cert/bundle.pem: A filename of the CA cert bundle to uses.

      You can specify this argument if you want to use a different CA cert bundle than the one used by botocore.

  • hive_cli_conn_id (str) – destination hive connection

  • input_compressed (bool) – Boolean to determine if file decompression is required to process headers

  • tblproperties (dict) – TBLPROPERTIES of the hive table being created

  • select_expression (str) – S3 Select expression

template_fields = ['s3_key', 'partition', 'hive_table'][source]
template_ext = [][source]
ui_color = #a0e08c[source]
execute(self, context)[source]
_get_top_row_as_list(self, file_name)[source]
_match_headers(self, header_list)[source]
static _delete_top_row_and_compress(input_file_name, output_file_ext, dest_dir)[source]

Was this entry helpful?