airflow.contrib.operators.qubole_operator
¶
Qubole operator
Module Contents¶
-
class
airflow.contrib.operators.qubole_operator.
QDSLink
[source]¶ Bases:
airflow.models.baseoperator.BaseOperatorLink
Link to QDS
-
class
airflow.contrib.operators.qubole_operator.
QuboleOperator
(qubole_conn_id='qubole_default', *args, **kwargs)[source]¶ Bases:
airflow.models.baseoperator.BaseOperator
Execute tasks (commands) on QDS (https://qubole.com).
- Parameters
qubole_conn_id (str) – Connection id which consists of qds auth_token
- kwargs:
- command_type
type of command to be executed, e.g. hivecmd, shellcmd, hadoopcmd
- tags
array of tags to be assigned with the command
- cluster_label
cluster label on which the command will be executed
- name
name to be given to command
- notify
whether to send email on command completion or not (default is False)
Arguments specific to command types
- hivecmd:
- query
inline query statement
- script_location
s3 location containing query statement
- sample_size
size of sample in bytes on which to run query
- macros
macro values which were used in query
- sample_size
size of sample in bytes on which to run query
- hive-version
Specifies the hive version to be used. eg: 0.13,1.2,etc.
- prestocmd:
- query
inline query statement
- script_location
s3 location containing query statement
- macros
macro values which were used in query
- hadoopcmd:
- sub_commnad
must be one these [“jar”, “s3distcp”, “streaming”] followed by 1 or more args
- shellcmd:
- script
inline command with args
- script_location
s3 location containing query statement
- files
list of files in s3 bucket as file1,file2 format. These files will be copied into the working directory where the qubole command is being executed.
- archives
list of archives in s3 bucket as archive1,archive2 format. These will be unarchived into the working directory where the qubole command is being executed
- parameters
any extra args which need to be passed to script (only when script_location is supplied)
- pigcmd:
- script
inline query statement (latin_statements)
- script_location
s3 location containing pig query
- parameters
any extra args which need to be passed to script (only when script_location is supplied
- sparkcmd:
- program
the complete Spark Program in Scala, R, or Python
- cmdline
spark-submit command line, all required information must be specify in cmdline itself.
- sql
inline sql query
- script_location
s3 location containing query statement
- language
language of the program, Scala, R, or Python
- app_id
ID of an Spark job server app
- arguments
spark-submit command line arguments
- user_program_arguments
arguments that the user program takes in
- macros
macro values which were used in query
- note_id
Id of the Notebook to run
- dbtapquerycmd:
- db_tap_id
data store ID of the target database, in Qubole.
- query
inline query statement
- macros
macro values which were used in query
- dbexportcmd:
- mode
Can be 1 for Hive export or 2 for HDFS/S3 export
- schema
Db schema name assumed accordingly by database if not specified
- hive_table
Name of the hive table
- partition_spec
partition specification for Hive table.
- dbtap_id
data store ID of the target database, in Qubole.
- db_table
name of the db table
- db_update_mode
allowinsert or updateonly
- db_update_keys
columns used to determine the uniqueness of rows
- export_dir
HDFS/S3 location from which data will be exported.
- fields_terminated_by
hex of the char used as column separator in the dataset
- use_customer_cluster
To use cluster to run command
- customer_cluster_label
the label of the cluster to run the command on
- additional_options
Additional Sqoop options which are needed enclose options in double or single quotes e.g. ‘–map-column-hive id=int,data=string’
- dbimportcmd:
- mode
1 (simple), 2 (advance)
- hive_table
Name of the hive table
- schema
Db schema name assumed accordingly by database if not specified
- hive_serde
Output format of the Hive Table
- dbtap_id
data store ID of the target database, in Qubole.
- db_table
name of the db table
- where_clause
where clause, if any
- parallelism
number of parallel db connections to use for extracting data
- extract_query
SQL query to extract data from db. $CONDITIONS must be part of the where clause.
- boundary_query
Query to be used get range of row IDs to be extracted
- split_column
Column used as row ID to split data into ranges (mode 2)
- use_customer_cluster
To use cluster to run command
- customer_cluster_label
the label of the cluster to run the command on
- additional_options
Additional Sqoop options which are needed enclose options in double or single quotes
-
template_fields
:Iterable[str] = ['query', 'script_location', 'sub_command', 'script', 'files', 'archives', 'program', 'cmdline', 'sql', 'where_clause', 'tags', 'extract_query', 'boundary_query', 'macros', 'name', 'parameters', 'dbtap_id', 'hive_table', 'db_table', 'split_column', 'note_id', 'db_update_keys', 'export_dir', 'partition_spec', 'qubole_conn_id', 'arguments', 'user_program_arguments', 'cluster_label'][source]¶