airflow.providers.apache.hive.transfers.s3_to_hive¶

This module contains an operator to move data from an S3 bucket to Hive.

Classes¶

S3ToHiveOperator

Moves data from S3 to Hive.

Functions¶

uncompress_file(input_file_name, file_extension, dest_dir)

Uncompress gz and bz2 files.

Module Contents¶

class airflow.providers.apache.hive.transfers.s3_to_hive.S3ToHiveOperator(*, s3_key, field_dict, hive_table, delimiter=',', create=True, recreate=False, partition=None, headers=False, check_headers=False, wildcard_match=False, aws_conn_id='aws_default', verify=None, hive_cli_conn_id='hive_cli_default', input_compressed=False, tblproperties=None, select_expression=None, hive_auth=None, **kwargs)[source]¶

Bases: airflow.models.BaseOperator

Moves data from S3 to Hive.

The operator downloads a file from S3, stores the file locally before loading it into a Hive table.

If the create or recreate arguments are set to True, a CREATE TABLE and DROP TABLE statements are generated. Hive data types are inferred from the cursor’s metadata from.

Note that the table generated in Hive uses STORED AS textfile which isn’t the most efficient serialization format. If a large amount of data is loaded and/or if the tables gets queried considerably, you may want to use this operator only to stage the data into a temporary table before loading it into its final destination using a HiveOperator.

Parameters:

s3_key (str) – The key to be retrieved from S3. (templated)
field_dict (dict) – A dictionary of the fields name in the file as keys and their Hive types as values
hive_table (str) – target Hive table, use dot notation to target a specific database. (templated)
delimiter (str) – field delimiter in the file
create (bool) – whether to create the table if it doesn’t exist
recreate (bool) – whether to drop and recreate the table at every execution
partition (dict | None) – target partition as a dict of partition columns and values. (templated)
headers (bool) – whether the file contains column names on the first line
check_headers (bool) – whether the column names on the first line should be checked against the keys of field_dict
wildcard_match (bool) – whether the s3_key should be interpreted as a Unix wildcard pattern
aws_conn_id (str | None) – source s3 connection
verify (bool | str | None) –
Whether or not to verify SSL certificates for S3 connection. By default SSL certificates are verified. You can provide the following values:
- False: do not validate SSL certificates. SSL will still be used
  (unless use_ssl is False), but SSL certificates will not be verified.
- path/to/cert/bundle.pem: A filename of the CA cert bundle to uses.
  You can specify this argument if you want to use a different CA cert bundle than the one used by botocore.
hive_cli_conn_id (str) – Reference to the Hive CLI connection id.
input_compressed (bool) – Boolean to determine if file decompression is required to process headers
tblproperties (dict | None) – TBLPROPERTIES of the hive table being created
select_expression (str | None) – S3 Select expression

template_fields: collections.abc.Sequence[str] = ('s3_key', 'partition', 'hive_table')[source]¶

template_ext: collections.abc.Sequence[str] = ()[source]¶

ui_color = '#a0e08c'[source]¶

s3_key[source]¶

field_dict[source]¶

hive_table[source]¶

delimiter = ','[source]¶

create = True[source]¶

recreate = False[source]¶

partition = None[source]¶

headers = False[source]¶

check_headers = False[source]¶

wildcard_match = False[source]¶

hive_cli_conn_id = 'hive_cli_default'[source]¶

aws_conn_id = 'aws_default'[source]¶

verify = None[source]¶

input_compressed = False[source]¶

tblproperties = None[source]¶

select_expression = None[source]¶

hive_auth = None[source]¶

execute(context)[source]¶

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

airflow.providers.apache.hive.transfers.s3_to_hive.uncompress_file(input_file_name, file_extension, dest_dir)[source]¶: Uncompress gz and bz2 files.