airflow.contrib.operators.qubole_operator

Module Contents

class airflow.contrib.operators.qubole_operator.QuboleOperator(qubole_conn_id='qubole_default', *args, **kwargs)[source]

Bases:airflow.models.BaseOperator

Execute tasks (commands) on QDS (https://qubole.com).

Parameters

qubole_conn_id (str) – Connection id which consists of qds auth_token

kwargs:
command_type

type of command to be executed, e.g. hivecmd, shellcmd, hadoopcmd

tags

array of tags to be assigned with the command

cluster_label

cluster label on which the command will be executed

name

name to be given to command

notify

whether to send email on command completion or not (default is False)

Arguments specific to command types

hivecmd:
query

inline query statement

script_location

s3 location containing query statement

sample_size

size of sample in bytes on which to run query

macros

macro values which were used in query

prestocmd:
query

inline query statement

script_location

s3 location containing query statement

macros

macro values which were used in query

hadoopcmd:
sub_commnad

must be one these [“jar”, “s3distcp”, “streaming”] followed by 1 or more args

shellcmd:
script

inline command with args

script_location

s3 location containing query statement

files

list of files in s3 bucket as file1,file2 format. These files will be copied into the working directory where the qubole command is being executed.

archives

list of archives in s3 bucket as archive1,archive2 format. These will be unarchived into the working directory where the qubole command is being executed

parameters

any extra args which need to be passed to script (only when script_location is supplied)

pigcmd:
script

inline query statement (latin_statements)

script_location

s3 location containing pig query

parameters

any extra args which need to be passed to script (only when script_location is supplied

sparkcmd:
program

the complete Spark Program in Scala, SQL, Command, R, or Python

cmdline

spark-submit command line, all required information must be specify in cmdline itself.

sql

inline sql query

script_location

s3 location containing query statement

language

language of the program, Scala, SQL, Command, R, or Python

app_id

ID of an Spark job server app

arguments

spark-submit command line arguments

user_program_arguments

arguments that the user program takes in

macros

macro values which were used in query

dbtapquerycmd:
db_tap_id

data store ID of the target database, in Qubole.

query

inline query statement

macros

macro values which were used in query

dbexportcmd:
mode

1 (simple), 2 (advance)

hive_table

Name of the hive table

partition_spec

partition specification for Hive table.

dbtap_id

data store ID of the target database, in Qubole.

db_table

name of the db table

db_update_mode

allowinsert or updateonly

db_update_keys

columns used to determine the uniqueness of rows

export_dir

HDFS/S3 location from which data will be exported.

fields_terminated_by

hex of the char used as column separator in the dataset

dbimportcmd:
mode

1 (simple), 2 (advance)

hive_table

Name of the hive table

dbtap_id

data store ID of the target database, in Qubole.

db_table

name of the db table

where_clause

where clause, if any

parallelism

number of parallel db connections to use for extracting data

extract_query

SQL query to extract data from db. $CONDITIONS must be part of the where clause.

boundary_query

Query to be used get range of row IDs to be extracted

split_column

Column used as row ID to split data into ranges (mode 2)

template_fields = ['query', 'script_location', 'sub_command', 'script', 'files', 'archives', 'program', 'cmdline', 'sql', 'where_clause', 'tags', 'extract_query', 'boundary_query', 'macros', 'name', 'parameters', 'dbtap_id', 'hive_table', 'db_table', 'split_column', 'note_id', 'db_update_keys', 'export_dir', 'partition_spec', 'qubole_conn_id', 'arguments', 'user_program_arguments', 'cluster_label'][source]
template_ext = ['.txt'][source]
ui_color = #3064A1[source]
ui_fgcolor = #fff[source]
execute(self, context)[source]
on_kill(self, ti=None)[source]
get_results(self, ti=None, fp=None, inline=True, delim=None, fetch=True)[source]
get_log(self, ti)[source]
get_jobs_id(self, ti)[source]
get_hook(self)[source]
__getattribute__(self, name)[source]
__setattr__(self, name, value)[source]