airflow.providers.apache.spark.operators.spark_submit

Module Contents

Classes

SparkSubmitOperator

This hook is a wrapper around the spark-submit binary to kick off a spark-submit job.

class airflow.providers.apache.spark.operators.spark_submit.SparkSubmitOperator(*, application='', conf=None, conn_id='spark_default', files=None, py_files=None, archives=None, driver_class_path=None, jars=None, java_class=None, packages=None, exclude_packages=None, repositories=None, total_executor_cores=None, executor_cores=None, executor_memory=None, driver_memory=None, keytab=None, principal=None, proxy_user=None, name='arrow-spark', num_executors=None, status_poll_interval=1, application_args=None, env_vars=None, verbose=False, spark_binary=None, **kwargs)[source]

Bases: airflow.models.BaseOperator

This hook is a wrapper around the spark-submit binary to kick off a spark-submit job. It requires that the “spark-submit” binary is in the PATH or the spark-home is set in the extra on the connection.

See also

For more information on how to use this operator, take a look at the guide: SparkSubmitOperator

Parameters
  • application (str) – The application that submitted as a job, either jar or py file. (templated)

  • conf (Optional[Dict[str, Any]]) – Arbitrary Spark configuration properties (templated)

  • spark_conn_id – The spark connection id as configured in Airflow administration. When an invalid connection_id is supplied, it will default to yarn.

  • files (Optional[str]) – Upload additional files to the executor running the job, separated by a comma. Files will be placed in the working directory of each executor. For example, serialized objects. (templated)

  • py_files (Optional[str]) – Additional python files used by the job, can be .zip, .egg or .py. (templated)

  • jars (Optional[str]) – Submit additional jars to upload and place them in executor classpath. (templated)

  • driver_class_path (Optional[str]) – Additional, driver-specific, classpath settings. (templated)

  • java_class (Optional[str]) – the main class of the Java application

  • packages (Optional[str]) – Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths. (templated)

  • exclude_packages (Optional[str]) – Comma-separated list of maven coordinates of jars to exclude while resolving the dependencies provided in ‘packages’ (templated)

  • repositories (Optional[str]) – Comma-separated list of additional remote repositories to search for the maven coordinates given with ‘packages’

  • total_executor_cores (Optional[int]) – (Standalone & Mesos only) Total cores for all executors (Default: all the available cores on the worker)

  • executor_cores (Optional[int]) – (Standalone & YARN only) Number of cores per executor (Default: 2)

  • executor_memory (Optional[str]) – Memory per executor (e.g. 1000M, 2G) (Default: 1G)

  • driver_memory (Optional[str]) – Memory allocated to the driver (e.g. 1000M, 2G) (Default: 1G)

  • keytab (Optional[str]) – Full path to the file that contains the keytab (templated)

  • principal (Optional[str]) – The name of the kerberos principal used for keytab (templated)

  • proxy_user (Optional[str]) – User to impersonate when submitting the application (templated)

  • name (str) – Name of the job (default airflow-spark). (templated)

  • num_executors (Optional[int]) – Number of executors to launch

  • status_poll_interval (int) – Seconds to wait between polls of driver status in cluster mode (Default: 1)

  • application_args (Optional[List[Any]]) – Arguments for the application being submitted (templated)

  • env_vars (Optional[Dict[str, Any]]) – Environment variables for spark-submit. It supports yarn and k8s mode too. (templated)

  • verbose (bool) – Whether to pass the verbose flag to spark-submit process for debugging

  • spark_binary (Optional[str]) – The command to use for spark submit. Some distros may use spark2-submit.

template_fields :Sequence[str] = ['_application', '_conf', '_files', '_py_files', '_jars', '_driver_class_path', '_packages',...[source]
ui_color[source]
execute(self, context)[source]

Call the SparkSubmitHook to run the provided spark job

on_kill(self)[source]

Override this method to cleanup subprocesses when a task instance gets killed. Any use of the threading, subprocess or multiprocessing module within an operator needs to be cleaned up or it will leave ghost processes behind.

Was this entry helpful?