airflow.providers.amazon.aws.operators.emr

Module Contents

Classes

EmrAddStepsOperator

An operator that adds steps to an existing EMR job_flow.

EmrStartNotebookExecutionOperator

An operator that starts an EMR notebook execution.

EmrStopNotebookExecutionOperator

An operator that stops a running EMR notebook execution.

EmrEksCreateClusterOperator

An operator that creates EMR on EKS virtual clusters.

EmrContainerOperator

An operator that submits jobs to EMR on EKS virtual clusters.

EmrCreateJobFlowOperator

Creates an EMR JobFlow, reading the config from the EMR connection.

EmrModifyClusterOperator

An operator that modifies an existing EMR cluster.

EmrTerminateJobFlowOperator

Operator to terminate EMR JobFlows.

EmrServerlessCreateApplicationOperator

Operator to create Serverless EMR Application

EmrServerlessStartJobOperator

Operator to start EMR Serverless job.

EmrServerlessDeleteApplicationOperator

Operator to delete EMR Serverless application

class airflow.providers.amazon.aws.operators.emr.EmrAddStepsOperator(*, job_flow_id=None, job_flow_name=None, cluster_states=None, aws_conn_id='aws_default', steps=None, wait_for_completion=False, waiter_delay=None, waiter_max_attempts=None, execution_role_arn=None, **kwargs)[source]

Bases: airflow.models.BaseOperator

An operator that adds steps to an existing EMR job_flow.

See also

For more information on how to use this operator, take a look at the guide: Add Steps to an EMR job flow

Parameters
  • job_flow_id (str | None) – id of the JobFlow to add steps to. (templated)

  • job_flow_name (str | None) – name of the JobFlow to add steps to. Use as an alternative to passing job_flow_id. will search for id of JobFlow with matching name in one of the states in param cluster_states. Exactly one cluster like this should exist or will fail. (templated)

  • cluster_states (list[str] | None) – Acceptable cluster states when searching for JobFlow id by job_flow_name. (templated)

  • aws_conn_id (str) – aws connection to uses

  • steps (list[dict] | str | None) – boto3 style steps or reference to a steps file (must be ‘.json’) to be added to the jobflow. (templated)

  • wait_for_completion (bool) – If True, the operator will wait for all the steps to be completed.

  • execution_role_arn (str | None) – The ARN of the runtime role for a step on the cluster.

  • do_xcom_push – if True, job_flow_id is pushed to XCom with key job_flow_id.

template_fields: Sequence[str] = ('job_flow_id', 'job_flow_name', 'cluster_states', 'steps', 'execution_role_arn')[source]
template_ext: Sequence[str] = ('.json',)[source]
template_fields_renderers[source]
ui_color = '#f9c915'[source]
execute(context)[source]

This is the main method to derive when creating an operator. Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

class airflow.providers.amazon.aws.operators.emr.EmrStartNotebookExecutionOperator(editor_id, relative_path, cluster_id, service_role, notebook_execution_name=None, notebook_params=None, notebook_instance_security_group_id=None, master_instance_security_group_id=None, tags=None, wait_for_completion=False, aws_conn_id='aws_default', waiter_max_attempts=NOTSET, waiter_delay=NOTSET, waiter_countdown=25 * 60, waiter_check_interval_seconds=60, **kwargs)[source]

Bases: airflow.models.BaseOperator

An operator that starts an EMR notebook execution.

See also

For more information on how to use this operator, take a look at the guide: Start an EMR notebook execution

Parameters
  • editor_id (str) – The unique identifier of the EMR notebook to use for notebook execution.

  • relative_path (str) – The path and file name of the notebook file for this execution, relative to the path specified for the EMR notebook.

  • cluster_id (str) – The unique identifier of the EMR cluster the notebook is attached to.

  • service_role (str) – The name or ARN of the IAM role that is used as the service role for Amazon EMR (the EMR role) for the notebook execution.

  • notebook_execution_name (str | None) – Optional name for the notebook execution.

  • notebook_params (str | None) – Input parameters in JSON format passed to the EMR notebook at runtime for execution.

  • tags (list | None) – Optional list of key value pair to associate with the notebook execution.

  • waiter_max_attempts (int | None | ArgNotSet) – Maximum number of tries before failing.

  • waiter_delay (int | None | ArgNotSet) – Number of seconds between polling the state of the notebook.

  • waiter_countdown (int) – Total amount of time the operator will wait for the notebook to stop. Defaults to 25 * 60 seconds. (Deprecated. Please use waiter_max_attempts.)

  • waiter_check_interval_seconds (int) – Number of seconds between polling the state of the notebook. Defaults to 60 seconds. (Deprecated. Please use waiter_delay.)

Param

notebook_instance_security_group_id: The unique identifier of the Amazon EC2 security group to associate with the EMR notebook for this notebook execution.

Param

master_instance_security_group_id: Optional unique ID of an EC2 security group to associate with the master instance of the EMR cluster for this notebook execution.

template_fields: Sequence[str] = ('editor_id', 'cluster_id', 'relative_path', 'service_role', 'notebook_execution_name',...[source]
execute(context)[source]

This is the main method to derive when creating an operator. Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

class airflow.providers.amazon.aws.operators.emr.EmrStopNotebookExecutionOperator(notebook_execution_id, wait_for_completion=False, aws_conn_id='aws_default', waiter_max_attempts=NOTSET, waiter_delay=NOTSET, waiter_countdown=25 * 60, waiter_check_interval_seconds=60, **kwargs)[source]

Bases: airflow.models.BaseOperator

An operator that stops a running EMR notebook execution.

See also

For more information on how to use this operator, take a look at the guide: Stop an EMR notebook execution

Parameters
  • notebook_execution_id (str) – The unique identifier of the notebook execution.

  • wait_for_completion (bool) – If True, the operator will wait for the notebook. to be in a STOPPED or FINISHED state. Defaults to False.

  • aws_conn_id (str) – aws connection to use.

  • waiter_max_attempts (int | None | ArgNotSet) – Maximum number of tries before failing.

  • waiter_delay (int | None | ArgNotSet) – Number of seconds between polling the state of the notebook.

  • waiter_countdown (int) – Total amount of time the operator will wait for the notebook to stop. Defaults to 25 * 60 seconds. (Deprecated. Please use waiter_max_attempts.)

  • waiter_check_interval_seconds (int) – Number of seconds between polling the state of the notebook. Defaults to 60 seconds. (Deprecated. Please use waiter_delay.)

template_fields: Sequence[str] = ('notebook_execution_id', 'waiter_delay', 'waiter_max_attempts')[source]
execute(context)[source]

This is the main method to derive when creating an operator. Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

class airflow.providers.amazon.aws.operators.emr.EmrEksCreateClusterOperator(*, virtual_cluster_name, eks_cluster_name, eks_namespace, virtual_cluster_id='', aws_conn_id='aws_default', tags=None, **kwargs)[source]

Bases: airflow.models.BaseOperator

An operator that creates EMR on EKS virtual clusters.

See also

For more information on how to use this operator, take a look at the guide: Create an Amazon EMR EKS virtual cluster

Parameters
  • virtual_cluster_name (str) – The name of the EMR EKS virtual cluster to create.

  • eks_cluster_name (str) – The EKS cluster used by the EMR virtual cluster.

  • eks_namespace (str) – namespace used by the EKS cluster.

  • virtual_cluster_id (str) – The EMR on EKS virtual cluster id.

  • aws_conn_id (str) – The Airflow connection used for AWS credentials.

  • tags (dict | None) – The tags assigned to created cluster. Defaults to None

template_fields: Sequence[str] = ('virtual_cluster_name', 'eks_cluster_name', 'eks_namespace')[source]
ui_color = '#f9c915'[source]
hook()[source]

Create and return an EmrContainerHook.

execute(context)[source]

Create EMR on EKS virtual Cluster

class airflow.providers.amazon.aws.operators.emr.EmrContainerOperator(*, name, virtual_cluster_id, execution_role_arn, release_label, job_driver, configuration_overrides=None, client_request_token=None, aws_conn_id='aws_default', wait_for_completion=True, poll_interval=30, max_tries=None, tags=None, max_polling_attempts=None, **kwargs)[source]

Bases: airflow.models.BaseOperator

An operator that submits jobs to EMR on EKS virtual clusters.

See also

For more information on how to use this operator, take a look at the guide: Submit a job to an Amazon EMR virtual cluster

Parameters
  • name (str) – The name of the job run.

  • virtual_cluster_id (str) – The EMR on EKS virtual cluster ID

  • execution_role_arn (str) – The IAM role ARN associated with the job run.

  • release_label (str) – The Amazon EMR release version to use for the job run.

  • job_driver (dict) – Job configuration details, e.g. the Spark job parameters.

  • configuration_overrides (dict | None) – The configuration overrides for the job run, specifically either application configuration or monitoring configuration.

  • client_request_token (str | None) – The client idempotency token of the job run request. Use this if you want to specify a unique ID to prevent two jobs from getting started. If no token is provided, a UUIDv4 token will be generated for you.

  • aws_conn_id (str) – The Airflow connection used for AWS credentials.

  • wait_for_completion (bool) – Whether or not to wait in the operator for the job to complete.

  • poll_interval (int) – Time (in seconds) to wait between two consecutive calls to check query status on EMR

  • max_tries (int | None) – Deprecated - use max_polling_attempts instead.

  • max_polling_attempts (int | None) – Maximum number of times to wait for the job run to finish. Defaults to None, which will poll until the job is not in a pending, submitted, or running state.

  • tags (dict | None) – The tags assigned to job runs. Defaults to None

template_fields: Sequence[str] = ('name', 'virtual_cluster_id', 'execution_role_arn', 'release_label', 'job_driver',...[source]
ui_color = '#f9c915'[source]
hook()[source]

Create and return an EmrContainerHook.

execute(context)[source]

Run job on EMR Containers

on_kill()[source]

Cancel the submitted job run

class airflow.providers.amazon.aws.operators.emr.EmrCreateJobFlowOperator(*, aws_conn_id='aws_default', emr_conn_id='emr_default', job_flow_overrides=None, region_name=None, wait_for_completion=False, waiter_max_attempts=NOTSET, waiter_delay=NOTSET, waiter_countdown=None, waiter_check_interval_seconds=60, **kwargs)[source]

Bases: airflow.models.BaseOperator

Creates an EMR JobFlow, reading the config from the EMR connection. A dictionary of JobFlow overrides can be passed that override the config from the connection.

See also

For more information on how to use this operator, take a look at the guide: Create an EMR job flow

Parameters
  • aws_conn_id (str) – The Airflow connection used for AWS credentials. If this is None or empty then the default boto3 behaviour is used. If running Airflow in a distributed manner and aws_conn_id is None or empty, then default boto3 configuration would be used (and must be maintained on each worker node)

  • emr_conn_id (str | None) – Amazon Elastic MapReduce Connection. Use to receive an initial Amazon EMR cluster configuration: boto3.client('emr').run_job_flow request body. If this is None or empty or the connection does not exist, then an empty initial configuration is used.

  • job_flow_overrides (str | dict[str, Any] | None) – boto3 style arguments or reference to an arguments file (must be ‘.json’) to override specific emr_conn_id extra parameters. (templated)

  • region_name (str | None) – Region named passed to EmrHook

  • wait_for_completion (bool) – Whether to finish task immediately after creation (False) or wait for jobflow completion (True)

  • waiter_max_attempts (int | None | ArgNotSet) – Maximum number of tries before failing.

  • waiter_delay (int | None | ArgNotSet) – Number of seconds between polling the state of the notebook.

  • waiter_countdown (int | None) – Max. seconds to wait for jobflow completion (only in combination with wait_for_completion=True, None = no limit) (Deprecated. Please use waiter_max_attempts.)

  • waiter_check_interval_seconds (int) – Number of seconds between polling the jobflow state. Defaults to 60 seconds. (Deprecated. Please use waiter_delay.)

template_fields: Sequence[str] = ('job_flow_overrides', 'waiter_delay', 'waiter_max_attempts')[source]
template_ext: Sequence[str] = ('.json',)[source]
template_fields_renderers[source]
ui_color = '#f9c915'[source]
execute(context)[source]

This is the main method to derive when creating an operator. Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

on_kill()[source]

Terminate the EMR cluster (job flow). If TerminationProtected=True on the cluster, termination will be unsuccessful.

class airflow.providers.amazon.aws.operators.emr.EmrModifyClusterOperator(*, cluster_id, step_concurrency_level, aws_conn_id='aws_default', **kwargs)[source]

Bases: airflow.models.BaseOperator

An operator that modifies an existing EMR cluster.

See also

For more information on how to use this operator, take a look at the guide: Modify Amazon EMR container

Parameters
  • cluster_id (str) – cluster identifier

  • step_concurrency_level (int) – Concurrency of the cluster

  • aws_conn_id (str) – aws connection to uses

  • do_xcom_push – if True, cluster_id is pushed to XCom with key cluster_id.

template_fields: Sequence[str] = ('cluster_id', 'step_concurrency_level')[source]
template_ext: Sequence[str] = ()[source]
ui_color = '#f9c915'[source]
execute(context)[source]

This is the main method to derive when creating an operator. Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

class airflow.providers.amazon.aws.operators.emr.EmrTerminateJobFlowOperator(*, job_flow_id, aws_conn_id='aws_default', **kwargs)[source]

Bases: airflow.models.BaseOperator

Operator to terminate EMR JobFlows.

See also

For more information on how to use this operator, take a look at the guide: Terminate an EMR job flow

Parameters
  • job_flow_id (str) – id of the JobFlow to terminate. (templated)

  • aws_conn_id (str) – aws connection to uses

template_fields: Sequence[str] = ('job_flow_id',)[source]
template_ext: Sequence[str] = ()[source]
ui_color = '#f9c915'[source]
execute(context)[source]

This is the main method to derive when creating an operator. Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

class airflow.providers.amazon.aws.operators.emr.EmrServerlessCreateApplicationOperator(release_label, job_type, client_request_token='', config=None, wait_for_completion=True, aws_conn_id='aws_default', waiter_countdown=25 * 60, waiter_check_interval_seconds=60, **kwargs)[source]

Bases: airflow.models.BaseOperator

Operator to create Serverless EMR Application

See also

For more information on how to use this operator, take a look at the guide: Create an EMR Serverless Application

Parameters
  • release_label (str) – The EMR release version associated with the application.

  • job_type (str) – The type of application you want to start, such as Spark or Hive.

  • wait_for_completion (bool) – If true, wait for the Application to start before returning. Default to True. If set to False, waiter_countdown and waiter_check_interval_seconds will only be applied when waiting for the application to be in the CREATED state.

  • client_request_token (str) – The client idempotency token of the application to create. Its value must be unique for each request.

  • config (dict | None) – Optional dictionary for arbitrary parameters to the boto API create_application call.

  • aws_conn_id (str) – AWS connection to use

  • waiter_countdown (int) – Total amount of time, in seconds, the operator will wait for the application to start. Defaults to 25 minutes.

  • waiter_check_interval_seconds (int) – Number of seconds between polling the state of the application. Defaults to 60 seconds.

hook()[source]

Create and return an EmrServerlessHook.

execute(context)[source]

This is the main method to derive when creating an operator. Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

class airflow.providers.amazon.aws.operators.emr.EmrServerlessStartJobOperator(application_id, execution_role_arn, job_driver, configuration_overrides, client_request_token='', config=None, wait_for_completion=True, aws_conn_id='aws_default', name=None, waiter_countdown=25 * 60, waiter_check_interval_seconds=60, **kwargs)[source]

Bases: airflow.models.BaseOperator

Operator to start EMR Serverless job.

See also

For more information on how to use this operator, take a look at the guide: Start an EMR Serverless Job

Parameters
  • application_id (str) – ID of the EMR Serverless application to start.

  • execution_role_arn (str) – ARN of role to perform action.

  • job_driver (dict) – Driver that the job runs on.

  • configuration_overrides (dict | None) – Configuration specifications to override existing configurations.

  • client_request_token (str) – The client idempotency token of the application to create. Its value must be unique for each request.

  • config (dict | None) – Optional dictionary for arbitrary parameters to the boto API start_job_run call.

  • wait_for_completion (bool) – If true, waits for the job to start before returning. Defaults to True. If set to False, waiter_countdown and waiter_check_interval_seconds will only be applied when waiting for the application be to in the STARTED state.

  • aws_conn_id (str) – AWS connection to use.

  • name (str | None) – Name for the EMR Serverless job. If not provided, a default name will be assigned.

  • waiter_countdown (int) – Total amount of time, in seconds, the operator will wait for the job finish. Defaults to 25 minutes.

  • waiter_check_interval_seconds (int) – Number of seconds between polling the state of the job. Defaults to 60 seconds.

template_fields: Sequence[str] = ('application_id', 'execution_role_arn', 'job_driver', 'configuration_overrides')[source]
hook()[source]

Create and return an EmrServerlessHook.

execute(context)[source]

This is the main method to derive when creating an operator. Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

class airflow.providers.amazon.aws.operators.emr.EmrServerlessDeleteApplicationOperator(application_id, wait_for_completion=True, aws_conn_id='aws_default', waiter_countdown=25 * 60, waiter_check_interval_seconds=60, **kwargs)[source]

Bases: airflow.models.BaseOperator

Operator to delete EMR Serverless application

See also

For more information on how to use this operator, take a look at the guide: Delete an EMR Serverless Application

Parameters
  • application_id (str) – ID of the EMR Serverless application to delete.

  • wait_for_completion (bool) – If true, wait for the Application to start before returning. Default to True

  • aws_conn_id (str) – AWS connection to use

  • waiter_countdown (int) – Total amount of time, in seconds, the operator will wait for the application be deleted. Defaults to 25 minutes.

  • waiter_check_interval_seconds (int) – Number of seconds between polling the state of the application. Defaults to 60 seconds.

template_fields: Sequence[str] = ('application_id',)[source]
hook()[source]

Create and return an EmrServerlessHook.

execute(context)[source]

This is the main method to derive when creating an operator. Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

Was this entry helpful?