airflow.providers.amazon.aws.operators.emr

Module Contents

Classes

EmrAddStepsOperator

An operator that adds steps to an existing EMR job_flow.

EmrStartNotebookExecutionOperator

An operator that starts an EMR notebook execution.

EmrStopNotebookExecutionOperator

An operator that stops a running EMR notebook execution.

EmrEksCreateClusterOperator

An operator that creates EMR on EKS virtual clusters.

EmrContainerOperator

An operator that submits jobs to EMR on EKS virtual clusters.

EmrCreateJobFlowOperator

Creates an EMR JobFlow, reading the config from the EMR connection.

EmrModifyClusterOperator

An operator that modifies an existing EMR cluster.

EmrTerminateJobFlowOperator

Operator to terminate EMR JobFlows.

EmrServerlessCreateApplicationOperator

Operator to create Serverless EMR Application.

EmrServerlessStartJobOperator

Operator to start EMR Serverless job.

EmrServerlessStopApplicationOperator

Operator to stop an EMR Serverless application.

EmrServerlessDeleteApplicationOperator

Operator to delete EMR Serverless application.

class airflow.providers.amazon.aws.operators.emr.EmrAddStepsOperator(*, job_flow_id=None, job_flow_name=None, cluster_states=None, aws_conn_id='aws_default', steps=None, wait_for_completion=False, waiter_delay=30, waiter_max_attempts=60, execution_role_arn=None, deferrable=conf.getboolean('operators', 'default_deferrable', fallback=False), **kwargs)[source]

Bases: airflow.models.BaseOperator

An operator that adds steps to an existing EMR job_flow.

See also

For more information on how to use this operator, take a look at the guide: Add Steps to an EMR job flow

Parameters
  • job_flow_id (str | None) – id of the JobFlow to add steps to. (templated)

  • job_flow_name (str | None) – name of the JobFlow to add steps to. Use as an alternative to passing job_flow_id. will search for id of JobFlow with matching name in one of the states in param cluster_states. Exactly one cluster like this should exist or will fail. (templated)

  • cluster_states (list[str] | None) – Acceptable cluster states when searching for JobFlow id by job_flow_name. (templated)

  • aws_conn_id (str | None) – The Airflow connection used for AWS credentials. If this is None or empty then the default boto3 behaviour is used. If running Airflow in a distributed manner and aws_conn_id is None or empty, then default boto3 configuration would be used (and must be maintained on each worker node).

  • steps (list[dict] | str | None) – boto3 style steps or reference to a steps file (must be ‘.json’) to be added to the jobflow. (templated)

  • wait_for_completion (bool) – If True, the operator will wait for all the steps to be completed.

  • execution_role_arn (str | None) – The ARN of the runtime role for a step on the cluster.

  • do_xcom_push – if True, job_flow_id is pushed to XCom with key job_flow_id.

  • wait_for_completion – Whether to wait for job run completion. (default: True)

  • deferrable (bool) – If True, the operator will wait asynchronously for the job to complete. This implies waiting for completion. This mode requires aiobotocore module to be installed. (default: False)

template_fields: Sequence[str] = ('job_flow_id', 'job_flow_name', 'cluster_states', 'steps', 'execution_role_arn')[source]
template_ext: Sequence[str] = ('.json',)[source]
template_fields_renderers[source]
ui_color = '#f9c915'[source]
execute(context)[source]

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

execute_complete(context, event=None)[source]
class airflow.providers.amazon.aws.operators.emr.EmrStartNotebookExecutionOperator(editor_id, relative_path, cluster_id, service_role, notebook_execution_name=None, notebook_params=None, notebook_instance_security_group_id=None, master_instance_security_group_id=None, tags=None, wait_for_completion=False, aws_conn_id='aws_default', waiter_max_attempts=None, waiter_delay=None, **kwargs)[source]

Bases: airflow.models.BaseOperator

An operator that starts an EMR notebook execution.

See also

For more information on how to use this operator, take a look at the guide: Start an EMR notebook execution

Parameters
  • editor_id (str) – The unique identifier of the EMR notebook to use for notebook execution.

  • relative_path (str) – The path and file name of the notebook file for this execution, relative to the path specified for the EMR notebook.

  • cluster_id (str) – The unique identifier of the EMR cluster the notebook is attached to.

  • service_role (str) – The name or ARN of the IAM role that is used as the service role for Amazon EMR (the EMR role) for the notebook execution.

  • notebook_execution_name (str | None) – Optional name for the notebook execution.

  • notebook_params (str | None) – Input parameters in JSON format passed to the EMR notebook at runtime for execution.

  • notebook_instance_security_group_id (str | None) – The unique identifier of the Amazon EC2 security group to associate with the EMR notebook for this notebook execution.

  • master_instance_security_group_id (str | None) – Optional unique ID of an EC2 security group to associate with the master instance of the EMR cluster for this notebook execution.

  • tags (list | None) – Optional list of key value pair to associate with the notebook execution.

  • waiter_max_attempts (int | None) – Maximum number of tries before failing.

  • waiter_delay (int | None) – Number of seconds between polling the state of the notebook.

template_fields: Sequence[str] = ('editor_id', 'cluster_id', 'relative_path', 'service_role', 'notebook_execution_name',...[source]
execute(context)[source]

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

class airflow.providers.amazon.aws.operators.emr.EmrStopNotebookExecutionOperator(notebook_execution_id, wait_for_completion=False, aws_conn_id='aws_default', waiter_max_attempts=None, waiter_delay=None, **kwargs)[source]

Bases: airflow.models.BaseOperator

An operator that stops a running EMR notebook execution.

See also

For more information on how to use this operator, take a look at the guide: Stop an EMR notebook execution

Parameters
  • notebook_execution_id (str) – The unique identifier of the notebook execution.

  • wait_for_completion (bool) – If True, the operator will wait for the notebook. to be in a STOPPED or FINISHED state. Defaults to False.

  • aws_conn_id (str | None) – aws connection to use. If this is None or empty then the default boto3 behaviour is used. If running Airflow in a distributed manner and aws_conn_id is None or empty, then default boto3 configuration would be used (and must be maintained on each worker node).

  • waiter_max_attempts (int | None) – Maximum number of tries before failing.

  • waiter_delay (int | None) – Number of seconds between polling the state of the notebook.

template_fields: Sequence[str] = ('notebook_execution_id', 'waiter_delay', 'waiter_max_attempts')[source]
execute(context)[source]

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

class airflow.providers.amazon.aws.operators.emr.EmrEksCreateClusterOperator(*, virtual_cluster_name, eks_cluster_name, eks_namespace, virtual_cluster_id='', aws_conn_id='aws_default', tags=None, **kwargs)[source]

Bases: airflow.models.BaseOperator

An operator that creates EMR on EKS virtual clusters.

See also

For more information on how to use this operator, take a look at the guide: Create an Amazon EMR EKS virtual cluster

Parameters
  • virtual_cluster_name (str) – The name of the EMR EKS virtual cluster to create.

  • eks_cluster_name (str) – The EKS cluster used by the EMR virtual cluster.

  • eks_namespace (str) – namespace used by the EKS cluster.

  • virtual_cluster_id (str) – The EMR on EKS virtual cluster id.

  • aws_conn_id (str | None) – The Airflow connection used for AWS credentials.

  • tags (dict | None) – The tags assigned to created cluster. Defaults to None

template_fields: Sequence[str] = ('virtual_cluster_name', 'eks_cluster_name', 'eks_namespace')[source]
ui_color = '#f9c915'[source]
hook()[source]

Create and return an EmrContainerHook.

execute(context)[source]

Create EMR on EKS virtual Cluster.

class airflow.providers.amazon.aws.operators.emr.EmrContainerOperator(*, name, virtual_cluster_id, execution_role_arn, release_label, job_driver, configuration_overrides=None, client_request_token=None, aws_conn_id='aws_default', wait_for_completion=True, poll_interval=30, tags=None, max_polling_attempts=None, job_retry_max_attempts=None, deferrable=conf.getboolean('operators', 'default_deferrable', fallback=False), **kwargs)[source]

Bases: airflow.models.BaseOperator

An operator that submits jobs to EMR on EKS virtual clusters.

See also

For more information on how to use this operator, take a look at the guide: Submit a job to an Amazon EMR virtual cluster

Parameters
  • name (str) – The name of the job run.

  • virtual_cluster_id (str) – The EMR on EKS virtual cluster ID

  • execution_role_arn (str) – The IAM role ARN associated with the job run.

  • release_label (str) – The Amazon EMR release version to use for the job run.

  • job_driver (dict) – Job configuration details, e.g. the Spark job parameters.

  • configuration_overrides (dict | None) – The configuration overrides for the job run, specifically either application configuration or monitoring configuration.

  • client_request_token (str | None) – The client idempotency token of the job run request. Use this if you want to specify a unique ID to prevent two jobs from getting started. If no token is provided, a UUIDv4 token will be generated for you.

  • aws_conn_id (str | None) – The Airflow connection used for AWS credentials.

  • wait_for_completion (bool) – Whether or not to wait in the operator for the job to complete.

  • poll_interval (int) – Time (in seconds) to wait between two consecutive calls to check query status on EMR

  • max_polling_attempts (int | None) – Maximum number of times to wait for the job run to finish. Defaults to None, which will poll until the job is not in a pending, submitted, or running state.

  • job_retry_max_attempts (int | None) – Maximum number of times to retry when the EMR job fails. Defaults to None, which disable the retry.

  • tags (dict | None) – The tags assigned to job runs. Defaults to None

  • deferrable (bool) – Run operator in the deferrable mode.

template_fields: Sequence[str] = ('name', 'virtual_cluster_id', 'execution_role_arn', 'release_label', 'job_driver',...[source]
ui_color = '#f9c915'[source]
hook()[source]

Create and return an EmrContainerHook.

execute(context)[source]

Run job on EMR Containers.

check_failure(query_status)[source]
execute_complete(context, event=None)[source]
on_kill()[source]

Cancel the submitted job run.

class airflow.providers.amazon.aws.operators.emr.EmrCreateJobFlowOperator(*, aws_conn_id='aws_default', emr_conn_id='emr_default', job_flow_overrides=None, region_name=None, wait_for_completion=False, waiter_max_attempts=None, waiter_delay=None, deferrable=conf.getboolean('operators', 'default_deferrable', fallback=False), **kwargs)[source]

Bases: airflow.models.BaseOperator

Creates an EMR JobFlow, reading the config from the EMR connection.

A dictionary of JobFlow overrides can be passed that override the config from the connection.

See also

For more information on how to use this operator, take a look at the guide: Create an EMR job flow

Parameters
  • aws_conn_id (str | None) – The Airflow connection used for AWS credentials. If this is None or empty then the default boto3 behaviour is used. If running Airflow in a distributed manner and aws_conn_id is None or empty, then default boto3 configuration would be used (and must be maintained on each worker node)

  • emr_conn_id (str | None) – Amazon Elastic MapReduce Connection. Use to receive an initial Amazon EMR cluster configuration: boto3.client('emr').run_job_flow request body. If this is None or empty or the connection does not exist, then an empty initial configuration is used.

  • job_flow_overrides (str | dict[str, Any] | None) – boto3 style arguments or reference to an arguments file (must be ‘.json’) to override specific emr_conn_id extra parameters. (templated)

  • region_name (str | None) – Region named passed to EmrHook

  • wait_for_completion (bool) – Whether to finish task immediately after creation (False) or wait for jobflow completion (True)

  • waiter_max_attempts (int | None) – Maximum number of tries before failing.

  • waiter_delay (int | None) – Number of seconds between polling the state of the notebook.

  • deferrable (bool) – If True, the operator will wait asynchronously for the crawl to complete. This implies waiting for completion. This mode requires aiobotocore module to be installed. (default: False)

template_fields: Sequence[str] = ('job_flow_overrides', 'waiter_delay', 'waiter_max_attempts')[source]
template_ext: Sequence[str] = ('.json',)[source]
template_fields_renderers[source]
ui_color = '#f9c915'[source]
execute(context)[source]

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

execute_complete(context, event=None)[source]
on_kill()[source]

Terminate the EMR cluster (job flow) unless TerminationProtected is enabled on the cluster.

class airflow.providers.amazon.aws.operators.emr.EmrModifyClusterOperator(*, cluster_id, step_concurrency_level, aws_conn_id='aws_default', **kwargs)[source]

Bases: airflow.models.BaseOperator

An operator that modifies an existing EMR cluster.

See also

For more information on how to use this operator, take a look at the guide: Modify Amazon EMR container

Parameters
  • cluster_id (str) – cluster identifier

  • step_concurrency_level (int) – Concurrency of the cluster

  • aws_conn_id (str | None) – aws connection to uses

  • aws_conn_id – aws connection to uses

  • do_xcom_push – if True, cluster_id is pushed to XCom with key cluster_id.

template_fields: Sequence[str] = ('cluster_id', 'step_concurrency_level')[source]
template_ext: Sequence[str] = ()[source]
ui_color = '#f9c915'[source]
execute(context)[source]

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

class airflow.providers.amazon.aws.operators.emr.EmrTerminateJobFlowOperator(*, job_flow_id, aws_conn_id='aws_default', waiter_delay=60, waiter_max_attempts=20, deferrable=conf.getboolean('operators', 'default_deferrable', fallback=False), **kwargs)[source]

Bases: airflow.models.BaseOperator

Operator to terminate EMR JobFlows.

See also

For more information on how to use this operator, take a look at the guide: Terminate an EMR job flow

Parameters
  • job_flow_id (str) – id of the JobFlow to terminate. (templated)

  • aws_conn_id (str | None) – The Airflow connection used for AWS credentials. If this is None or empty then the default boto3 behaviour is used. If running Airflow in a distributed manner and aws_conn_id is None or empty, then default boto3 configuration would be used (and must be maintained on each worker node).

  • waiter_delay (int) – Time (in seconds) to wait between two consecutive calls to check JobFlow status

  • waiter_max_attempts (int) – The maximum number of times to poll for JobFlow status.

  • deferrable (bool) – If True, the operator will wait asynchronously for the crawl to complete. This implies waiting for completion. This mode requires aiobotocore module to be installed. (default: False)

template_fields: Sequence[str] = ('job_flow_id',)[source]
template_ext: Sequence[str] = ()[source]
ui_color = '#f9c915'[source]
execute(context)[source]

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

execute_complete(context, event=None)[source]
class airflow.providers.amazon.aws.operators.emr.EmrServerlessCreateApplicationOperator(release_label, job_type, client_request_token='', config=None, wait_for_completion=True, aws_conn_id='aws_default', waiter_max_attempts=NOTSET, waiter_delay=NOTSET, deferrable=conf.getboolean('operators', 'default_deferrable', fallback=False), **kwargs)[source]

Bases: airflow.models.BaseOperator

Operator to create Serverless EMR Application.

See also

For more information on how to use this operator, take a look at the guide: Create an EMR Serverless Application

Parameters
  • release_label (str) – The EMR release version associated with the application.

  • job_type (str) – The type of application you want to start, such as Spark or Hive.

  • wait_for_completion (bool) – If true, wait for the Application to start before returning. Default to True. If set to False, waiter_max_attempts and waiter_delay will only be applied when waiting for the application to be in the CREATED state.

  • client_request_token (str) – The client idempotency token of the application to create. Its value must be unique for each request.

  • config (dict | None) – Optional dictionary for arbitrary parameters to the boto API create_application call.

  • aws_conn_id (str | None) – The Airflow connection used for AWS credentials. If this is None or empty then the default boto3 behaviour is used. If running Airflow in a distributed manner and aws_conn_id is None or empty, then default boto3 configuration would be used (and must be maintained on each worker node).

  • waiter_delay (int | airflow.utils.types.ArgNotSet) – Number of seconds between polling the state of the application.

  • deferrable (bool) – If True, the operator will wait asynchronously for application to be created. This implies waiting for completion. This mode requires aiobotocore module to be installed. (default: False, but can be overridden in config file by setting default_deferrable to True)

Waiter_max_attempts

Number of times the waiter should poll the application to check the state. If not set, the waiter will use its default value.

hook()[source]

Create and return an EmrServerlessHook.

execute(context)[source]

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

start_application_deferred(context, event=None)[source]
execute_complete(context, event=None)[source]
class airflow.providers.amazon.aws.operators.emr.EmrServerlessStartJobOperator(application_id, execution_role_arn, job_driver, configuration_overrides=None, client_request_token='', config=None, wait_for_completion=True, aws_conn_id='aws_default', name=None, waiter_max_attempts=NOTSET, waiter_delay=NOTSET, deferrable=conf.getboolean('operators', 'default_deferrable', fallback=False), enable_application_ui_links=False, **kwargs)[source]

Bases: airflow.models.BaseOperator

Operator to start EMR Serverless job.

See also

For more information on how to use this operator, take a look at the guide: Start an EMR Serverless Job

Parameters
  • application_id (str) – ID of the EMR Serverless application to start.

  • execution_role_arn (str) – ARN of role to perform action.

  • job_driver (dict) – Driver that the job runs on.

  • configuration_overrides (dict | None) – Configuration specifications to override existing configurations.

  • client_request_token (str) – The client idempotency token of the application to create. Its value must be unique for each request.

  • config (dict | None) – Optional dictionary for arbitrary parameters to the boto API start_job_run call.

  • wait_for_completion (bool) – If true, waits for the job to start before returning. Defaults to True. If set to False, waiter_countdown and waiter_check_interval_seconds will only be applied when waiting for the application be to in the STARTED state.

  • aws_conn_id (str | None) – The Airflow connection used for AWS credentials. If this is None or empty then the default boto3 behaviour is used. If running Airflow in a distributed manner and aws_conn_id is None or empty, then default boto3 configuration would be used (and must be maintained on each worker node).

  • name (str | None) – Name for the EMR Serverless job. If not provided, a default name will be assigned.

  • waiter_delay (int | airflow.utils.types.ArgNotSet) – Number of seconds between polling the state of the job run.

  • deferrable (bool) – If True, the operator will wait asynchronously for the crawl to complete. This implies waiting for completion. This mode requires aiobotocore module to be installed. (default: False, but can be overridden in config file by setting default_deferrable to True)

  • enable_application_ui_links (bool) – If True, the operator will generate one-time links to EMR Serverless application UIs. The generated links will allow any user with access to the DAG to see the Spark or Tez UI or Spark stdout logs. Defaults to False.

Waiter_max_attempts

Number of times the waiter should poll the application to check the state. If not set, the waiter will use its default value.

template_fields: Sequence[str] = ('application_id', 'config', 'execution_role_arn', 'job_driver', 'configuration_overrides',...[source]
template_fields_renderers[source]
hook()[source]

Create and return an EmrServerlessHook.

execute(context, event=None)[source]

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

execute_complete(context, event=None)[source]
on_kill()[source]

Cancel the submitted job run.

Note: this method will not run in deferrable mode.

is_monitoring_in_job_override(config_key, job_override)[source]

Check if monitoring is enabled for the job.

Note: This is not compatible with application defaults: https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/default-configs.html

This is used to determine what extra links should be shown.

Populate the relevant extra links for the EMR Serverless jobs.

class airflow.providers.amazon.aws.operators.emr.EmrServerlessStopApplicationOperator(application_id, wait_for_completion=True, aws_conn_id='aws_default', waiter_max_attempts=NOTSET, waiter_delay=NOTSET, force_stop=False, deferrable=conf.getboolean('operators', 'default_deferrable', fallback=False), **kwargs)[source]

Bases: airflow.models.BaseOperator

Operator to stop an EMR Serverless application.

See also

For more information on how to use this operator, take a look at the guide: Open Application UIs

Parameters
  • application_id (str) – ID of the EMR Serverless application to stop.

  • wait_for_completion (bool) – If true, wait for the Application to stop before returning. Default to True

  • aws_conn_id (str | None) – The Airflow connection used for AWS credentials. If this is None or empty then the default boto3 behaviour is used. If running Airflow in a distributed manner and aws_conn_id is None or empty, then default boto3 configuration would be used (and must be maintained on each worker node).

  • force_stop (bool) – If set to True, any job for that app that is not in a terminal state will be cancelled. Otherwise, trying to stop an app with running jobs will return an error. If you want to wait for the jobs to finish gracefully, use airflow.providers.amazon.aws.sensors.emr.EmrServerlessJobSensor

  • waiter_delay (int | airflow.utils.types.ArgNotSet) – Number of seconds between polling the state of the application. Default is 60 seconds.

  • deferrable (bool) – If True, the operator will wait asynchronously for the application to stop. This implies waiting for completion. This mode requires aiobotocore module to be installed. (default: False, but can be overridden in config file by setting default_deferrable to True)

Waiter_max_attempts

Number of times the waiter should poll the application to check the state. Default is 25.

template_fields: Sequence[str] = ('application_id',)[source]
hook()[source]

Create and return an EmrServerlessHook.

execute(context)[source]

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

stop_application(context, event=None)[source]
execute_complete(context, event=None)[source]
class airflow.providers.amazon.aws.operators.emr.EmrServerlessDeleteApplicationOperator(application_id, wait_for_completion=True, aws_conn_id='aws_default', waiter_max_attempts=NOTSET, waiter_delay=NOTSET, force_stop=False, deferrable=conf.getboolean('operators', 'default_deferrable', fallback=False), **kwargs)[source]

Bases: EmrServerlessStopApplicationOperator

Operator to delete EMR Serverless application.

See also

For more information on how to use this operator, take a look at the guide: Delete an EMR Serverless Application

Parameters
  • application_id (str) – ID of the EMR Serverless application to delete.

  • wait_for_completion (bool) – If true, wait for the Application to be deleted before returning. Defaults to True. Note that this operator will always wait for the application to be STOPPED first.

  • aws_conn_id (str | None) – The Airflow connection used for AWS credentials. If this is None or empty then the default boto3 behaviour is used. If running Airflow in a distributed manner and aws_conn_id is None or empty, then default boto3 configuration would be used (and must be maintained on each worker node).

  • waiter_delay (int | airflow.utils.types.ArgNotSet) – Number of seconds between polling the state of the application. Defaults to 60 seconds.

  • deferrable (bool) – If True, the operator will wait asynchronously for application to be deleted. This implies waiting for completion. This mode requires aiobotocore module to be installed. (default: False, but can be overridden in config file by setting default_deferrable to True)

  • force_stop (bool) – If set to True, any job for that app that is not in a terminal state will be cancelled. Otherwise, trying to delete an app with running jobs will return an error. If you want to wait for the jobs to finish gracefully, use airflow.providers.amazon.aws.sensors.emr.EmrServerlessJobSensor

Waiter_max_attempts

Number of times the waiter should poll the application to check the state. Defaults to 25.

template_fields: Sequence[str] = ('application_id',)[source]
execute(context)[source]

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

execute_complete(context, event=None)[source]

Was this entry helpful?