`airflow.contrib.operators.qubole_operator`¶

Qubole operator

Module Contents¶

class airflow.contrib.operators.qubole_operator.QDSLink[source]¶

Bases: airflow.models.baseoperator.BaseOperatorLink

Link to QDS

name = Go to QDS[source]¶

get_link(self, operator, dttm)[source]¶

Get link to qubole command result page.

Parameters

operator – operator
dttm – datetime

Returns

url link

class airflow.contrib.operators.qubole_operator.QuboleOperator(qubole_conn_id='qubole_default', *args, **kwargs)[source]¶

Bases: airflow.models.baseoperator.BaseOperator

Execute tasks (commands) on QDS (https://qubole.com).

Parameters: qubole_conn_id (str) – Connection id which consists of qds auth_token

kwargs:

command_type: type of command to be executed, e.g. hivecmd, shellcmd, hadoopcmd
tags: array of tags to be assigned with the command
cluster_label: cluster label on which the command will be executed
name: name to be given to command
notify: whether to send email on command completion or not (default is False)

Arguments specific to command types

hivecmd:

query: inline query statement
script_location: s3 location containing query statement
sample_size: size of sample in bytes on which to run query
macros: macro values which were used in query
sample_size: size of sample in bytes on which to run query
hive-version: Specifies the hive version to be used. eg: 0.13,1.2,etc.

prestocmd:

query: inline query statement
script_location: s3 location containing query statement
macros: macro values which were used in query

hadoopcmd:

sub_commnad: must be one these [“jar”, “s3distcp”, “streaming”] followed by 1 or more args

shellcmd:

script: inline command with args
script_location: s3 location containing query statement
files: list of files in s3 bucket as file1,file2 format. These files will be copied into the working directory where the qubole command is being executed.
archives: list of archives in s3 bucket as archive1,archive2 format. These will be unarchived into the working directory where the qubole command is being executed
parameters: any extra args which need to be passed to script (only when script_location is supplied)

pigcmd:

script: inline query statement (latin_statements)
script_location: s3 location containing pig query
parameters: any extra args which need to be passed to script (only when script_location is supplied

sparkcmd:

program: the complete Spark Program in Scala, R, or Python
cmdline: spark-submit command line, all required information must be specify in cmdline itself.
sql: inline sql query
script_location: s3 location containing query statement
language: language of the program, Scala, R, or Python
app_id: ID of an Spark job server app
arguments: spark-submit command line arguments
user_program_arguments: arguments that the user program takes in
macros: macro values which were used in query
note_id: Id of the Notebook to run

dbtapquerycmd:

db_tap_id: data store ID of the target database, in Qubole.
query: inline query statement
macros: macro values which were used in query

dbexportcmd:

mode: Can be 1 for Hive export or 2 for HDFS/S3 export
schema: Db schema name assumed accordingly by database if not specified
hive_table: Name of the hive table
partition_spec: partition specification for Hive table.
dbtap_id: data store ID of the target database, in Qubole.
db_table: name of the db table
db_update_mode: allowinsert or updateonly
db_update_keys: columns used to determine the uniqueness of rows
export_dir: HDFS/S3 location from which data will be exported.
fields_terminated_by: hex of the char used as column separator in the dataset
use_customer_cluster: To use cluster to run command
customer_cluster_label: the label of the cluster to run the command on
additional_options: Additional Sqoop options which are needed enclose options in double or single quotes e.g. ‘–map-column-hive id=int,data=string’

dbimportcmd:

mode: 1 (simple), 2 (advance)
hive_table: Name of the hive table
schema: Db schema name assumed accordingly by database if not specified
hive_serde: Output format of the Hive Table
dbtap_id: data store ID of the target database, in Qubole.
db_table: name of the db table
where_clause: where clause, if any
parallelism: number of parallel db connections to use for extracting data
extract_query: SQL query to extract data from db. $CONDITIONS must be part of the where clause.
boundary_query: Query to be used get range of row IDs to be extracted
split_column: Column used as row ID to split data into ranges (mode 2)
use_customer_cluster: To use cluster to run command
customer_cluster_label: the label of the cluster to run the command on
additional_options: Additional Sqoop options which are needed enclose options in double or single quotes

template_fields :Iterable[str] = ['query', 'script_location', 'sub_command', 'script', 'files', 'archives', 'program', 'cmdline', 'sql', 'where_clause', 'tags', 'extract_query', 'boundary_query', 'macros', 'name', 'parameters', 'dbtap_id', 'hive_table', 'db_table', 'split_column', 'note_id', 'db_update_keys', 'export_dir', 'partition_spec', 'qubole_conn_id', 'arguments', 'user_program_arguments', 'cluster_label'][source]¶

template_ext :Iterable[str] = ['.txt'][source]¶

ui_color = #3064A1[source]¶

ui_fgcolor = #fff[source]¶

qubole_hook_allowed_args_list = ['command_type', 'qubole_conn_id', 'fetch_logs'][source]¶

operator_extra_links[source]¶

_get_filtered_args(self, all_kwargs)[source]¶

execute(self, context)[source]¶

on_kill(self, ti=None)[source]¶

get_results(self, ti=None, fp=None, inline=True, delim=None, fetch=True)[source]¶: get_results from Qubole

get_log(self, ti)[source]¶: get_log from Qubole

get_jobs_id(self, ti)[source]¶: get jobs_id from Qubole

get_hook(self)[source]¶: Reinitialising the hook, as some template fields might have changed

__getattribute__(self, name)[source]¶

__setattr__(self, name, value)[source]¶

airflow.contrib.operators.qubole_operator¶

Module Contents¶

`airflow.contrib.operators.qubole_operator`¶