airflow.contrib.operators.qubole_operator

Qubole operator

Module Contents

Bases: airflow.models.baseoperator.BaseOperatorLink

Link to QDS

name = Go to QDS[source]

Get link to qubole command result page.

Parameters
  • operator – operator

  • dttm – datetime

Returns

url link

class airflow.contrib.operators.qubole_operator.QuboleOperator(qubole_conn_id='qubole_default', *args, **kwargs)[source]

Bases: airflow.models.baseoperator.BaseOperator

Execute tasks (commands) on QDS (https://qubole.com).

Parameters

qubole_conn_id (str) – Connection id which consists of qds auth_token

kwargs:
command_type

type of command to be executed, e.g. hivecmd, shellcmd, hadoopcmd

tags

array of tags to be assigned with the command

cluster_label

cluster label on which the command will be executed

name

name to be given to command

notify

whether to send email on command completion or not (default is False)

Arguments specific to command types

hivecmd:
query

inline query statement

script_location

s3 location containing query statement

sample_size

size of sample in bytes on which to run query

macros

macro values which were used in query

sample_size

size of sample in bytes on which to run query

hive-version

Specifies the hive version to be used. eg: 0.13,1.2,etc.

prestocmd:
query

inline query statement

script_location

s3 location containing query statement

macros

macro values which were used in query

hadoopcmd:
sub_commnad

must be one these [“jar”, “s3distcp”, “streaming”] followed by 1 or more args

shellcmd:
script

inline command with args

script_location

s3 location containing query statement

files

list of files in s3 bucket as file1,file2 format. These files will be copied into the working directory where the qubole command is being executed.

archives

list of archives in s3 bucket as archive1,archive2 format. These will be unarchived into the working directory where the qubole command is being executed

parameters

any extra args which need to be passed to script (only when script_location is supplied)

pigcmd:
script

inline query statement (latin_statements)

script_location

s3 location containing pig query

parameters

any extra args which need to be passed to script (only when script_location is supplied

sparkcmd:
program

the complete Spark Program in Scala, R, or Python

cmdline

spark-submit command line, all required information must be specify in cmdline itself.

sql

inline sql query

script_location

s3 location containing query statement

language

language of the program, Scala, R, or Python

app_id

ID of an Spark job server app

arguments

spark-submit command line arguments

user_program_arguments

arguments that the user program takes in

macros

macro values which were used in query

note_id

Id of the Notebook to run

dbtapquerycmd:
db_tap_id

data store ID of the target database, in Qubole.

query

inline query statement

macros

macro values which were used in query

dbexportcmd:
mode

Can be 1 for Hive export or 2 for HDFS/S3 export

schema

Db schema name assumed accordingly by database if not specified

hive_table

Name of the hive table

partition_spec

partition specification for Hive table.

dbtap_id

data store ID of the target database, in Qubole.

db_table

name of the db table

db_update_mode

allowinsert or updateonly

db_update_keys

columns used to determine the uniqueness of rows

export_dir

HDFS/S3 location from which data will be exported.

fields_terminated_by

hex of the char used as column separator in the dataset

use_customer_cluster

To use cluster to run command

customer_cluster_label

the label of the cluster to run the command on

additional_options

Additional Sqoop options which are needed enclose options in double or single quotes e.g. ‘–map-column-hive id=int,data=string’

dbimportcmd:
mode

1 (simple), 2 (advance)

hive_table

Name of the hive table

schema

Db schema name assumed accordingly by database if not specified

hive_serde

Output format of the Hive Table

dbtap_id

data store ID of the target database, in Qubole.

db_table

name of the db table

where_clause

where clause, if any

parallelism

number of parallel db connections to use for extracting data

extract_query

SQL query to extract data from db. $CONDITIONS must be part of the where clause.

boundary_query

Query to be used get range of row IDs to be extracted

split_column

Column used as row ID to split data into ranges (mode 2)

use_customer_cluster

To use cluster to run command

customer_cluster_label

the label of the cluster to run the command on

additional_options

Additional Sqoop options which are needed enclose options in double or single quotes

template_fields :Iterable[str] = ['query', 'script_location', 'sub_command', 'script', 'files', 'archives', 'program', 'cmdline', 'sql', 'where_clause', 'tags', 'extract_query', 'boundary_query', 'macros', 'name', 'parameters', 'dbtap_id', 'hive_table', 'db_table', 'split_column', 'note_id', 'db_update_keys', 'export_dir', 'partition_spec', 'qubole_conn_id', 'arguments', 'user_program_arguments', 'cluster_label'][source]
template_ext :Iterable[str] = ['.txt'][source]
ui_color = #3064A1[source]
ui_fgcolor = #fff[source]
qubole_hook_allowed_args_list = ['command_type', 'qubole_conn_id', 'fetch_logs'][source]
__serialized_fields :Optional[FrozenSet[str]][source]
_get_filtered_args(self, all_kwargs)[source]
execute(self, context)[source]
on_kill(self, ti=None)[source]
get_results(self, ti=None, fp=None, inline=True, delim=None, fetch=True)[source]

get_results from Qubole

get_log(self, ti)[source]

get_log from Qubole

get_jobs_id(self, ti)[source]

get jobs_id from Qubole

get_hook(self)[source]

Reinitialising the hook, as some template fields might have changed

__getattribute__(self, name)[source]
__setattr__(self, name, value)[source]
classmethod get_serialized_fields(cls)[source]

Serialized QuboleOperator contain exactly these fields.

Was this entry helpful?