Source code for airflow.providers.apache.spark.operators.spark_submit
## Licensed to the Apache Software Foundation (ASF) under one# or more contributor license agreements. See the NOTICE file# distributed with this work for additional information# regarding copyright ownership. The ASF licenses this file# to you under the Apache License, Version 2.0 (the# "License"); you may not use this file except in compliance# with the License. You may obtain a copy of the License at## http://www.apache.org/licenses/LICENSE-2.0## Unless required by applicable law or agreed to in writing,# software distributed under the License is distributed on an# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY# KIND, either express or implied. See the License for the# specific language governing permissions and limitations# under the License.from__future__importannotationsfromcollections.abcimportSequencefromtypingimportTYPE_CHECKING,Anyfromairflow.modelsimportBaseOperatorfromairflow.providers.apache.spark.hooks.spark_submitimportSparkSubmitHookfromairflow.settingsimportWEB_COLORSifTYPE_CHECKING:fromairflow.utils.contextimportContext
[docs]classSparkSubmitOperator(BaseOperator):""" Wrap the spark-submit binary to kick off a spark-submit job; requires "spark-submit" binary in the PATH. .. seealso:: For more information on how to use this operator, take a look at the guide: :ref:`howto/operator:SparkSubmitOperator` :param application: The application that submitted as a job, either jar or py file. (templated) :param conf: Arbitrary Spark configuration properties (templated) :param conn_id: The :ref:`spark connection id <howto/connection:spark-submit>` as configured in Airflow administration. When an invalid connection_id is supplied, it will default to yarn. :param files: Upload additional files to the executor running the job, separated by a comma. Files will be placed in the working directory of each executor. For example, serialized objects. (templated) :param py_files: Additional python files used by the job, can be .zip, .egg or .py. (templated) :param jars: Submit additional jars to upload and place them in driver and executor classpaths. (templated) :param driver_class_path: Additional, driver-specific, classpath settings. (templated) :param java_class: the main class of the Java application :param packages: Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths. (templated) :param exclude_packages: Comma-separated list of maven coordinates of jars to exclude while resolving the dependencies provided in 'packages' (templated) :param repositories: Comma-separated list of additional remote repositories to search for the maven coordinates given with 'packages' :param total_executor_cores: (Standalone & Mesos only) Total cores for all executors (Default: all the available cores on the worker) :param executor_cores: (Standalone & YARN only) Number of cores per executor (Default: 2) :param executor_memory: Memory per executor (e.g. 1000M, 2G) (Default: 1G) :param driver_memory: Memory allocated to the driver (e.g. 1000M, 2G) (Default: 1G) :param keytab: Full path to the file that contains the keytab (templated) (will overwrite any keytab defined in the connection's extra JSON) :param principal: The name of the kerberos principal used for keytab (templated) (will overwrite any principal defined in the connection's extra JSON) :param proxy_user: User to impersonate when submitting the application (templated) :param name: Name of the job (default airflow-spark). (templated) :param num_executors: Number of executors to launch :param status_poll_interval: Seconds to wait between polls of driver status in cluster mode (Default: 1) :param application_args: Arguments for the application being submitted (templated) :param env_vars: Environment variables for spark-submit. It supports yarn and k8s mode too. (templated) :param verbose: Whether to pass the verbose flag to spark-submit process for debugging :param spark_binary: The command to use for spark submit. Some distros may use spark2-submit or spark3-submit. (will overwrite any spark_binary defined in the connection's extra JSON) :param properties_file: Path to a file from which to load extra properties. If not specified, this will look for conf/spark-defaults.conf. :param yarn_queue: The name of the YARN queue to which the application is submitted. (will overwrite any yarn queue defined in the connection's extra JSON) :param deploy_mode: Whether to deploy your driver on the worker nodes (cluster) or locally as a client. (will overwrite any deployment mode defined in the connection's extra JSON) :param use_krb5ccache: if True, configure spark to use ticket cache instead of relying on keytab for Kerberos login """
[docs]defexecute(self,context:Context)->None:"""Call the SparkSubmitHook to run the provided spark job."""ifself._hookisNone:self._hook=self._get_hook()self._hook.submit(self.application)