airflow.providers.amazon.aws.operators.emr
¶
Module Contents¶
Classes¶
An operator that adds steps to an existing EMR job_flow. |
|
An operator that starts an EMR notebook execution. |
|
An operator that stops a running EMR notebook execution. |
|
An operator that creates EMR on EKS virtual clusters. |
|
An operator that submits jobs to EMR on EKS virtual clusters. |
|
Creates an EMR JobFlow, reading the config from the EMR connection. |
|
An operator that modifies an existing EMR cluster. |
|
Operator to terminate EMR JobFlows. |
|
Operator to create Serverless EMR Application. |
|
Operator to start EMR Serverless job. |
|
Operator to stop an EMR Serverless application. |
|
Operator to delete EMR Serverless application. |
- class airflow.providers.amazon.aws.operators.emr.EmrAddStepsOperator(*, job_flow_id=None, job_flow_name=None, cluster_states=None, aws_conn_id='aws_default', steps=None, wait_for_completion=False, waiter_delay=30, waiter_max_attempts=60, execution_role_arn=None, deferrable=conf.getboolean('operators', 'default_deferrable', fallback=False), **kwargs)[source]¶
Bases:
airflow.models.BaseOperator
An operator that adds steps to an existing EMR job_flow.
See also
For more information on how to use this operator, take a look at the guide: Add Steps to an EMR job flow
- Parameters
job_flow_id (str | None) – id of the JobFlow to add steps to. (templated)
job_flow_name (str | None) – name of the JobFlow to add steps to. Use as an alternative to passing job_flow_id. will search for id of JobFlow with matching name in one of the states in param cluster_states. Exactly one cluster like this should exist or will fail. (templated)
cluster_states (list[str] | None) – Acceptable cluster states when searching for JobFlow id by job_flow_name. (templated)
aws_conn_id (str | None) – The Airflow connection used for AWS credentials. If this is None or empty then the default boto3 behaviour is used. If running Airflow in a distributed manner and aws_conn_id is None or empty, then default boto3 configuration would be used (and must be maintained on each worker node).
steps (list[dict] | str | None) – boto3 style steps or reference to a steps file (must be ‘.json’) to be added to the jobflow. (templated)
wait_for_completion (bool) – If True, the operator will wait for all the steps to be completed.
execution_role_arn (str | None) – The ARN of the runtime role for a step on the cluster.
do_xcom_push – if True, job_flow_id is pushed to XCom with key job_flow_id.
wait_for_completion – Whether to wait for job run completion. (default: True)
deferrable (bool) – If True, the operator will wait asynchronously for the job to complete. This implies waiting for completion. This mode requires aiobotocore module to be installed. (default: False)
- template_fields: Sequence[str] = ('job_flow_id', 'job_flow_name', 'cluster_states', 'steps', 'execution_role_arn')[source]¶
- class airflow.providers.amazon.aws.operators.emr.EmrStartNotebookExecutionOperator(editor_id, relative_path, cluster_id, service_role, notebook_execution_name=None, notebook_params=None, notebook_instance_security_group_id=None, master_instance_security_group_id=None, tags=None, wait_for_completion=False, aws_conn_id='aws_default', waiter_max_attempts=None, waiter_delay=None, waiter_countdown=None, waiter_check_interval_seconds=None, **kwargs)[source]¶
Bases:
airflow.models.BaseOperator
An operator that starts an EMR notebook execution.
See also
For more information on how to use this operator, take a look at the guide: Start an EMR notebook execution
- Parameters
editor_id (str) – The unique identifier of the EMR notebook to use for notebook execution.
relative_path (str) – The path and file name of the notebook file for this execution, relative to the path specified for the EMR notebook.
cluster_id (str) – The unique identifier of the EMR cluster the notebook is attached to.
service_role (str) – The name or ARN of the IAM role that is used as the service role for Amazon EMR (the EMR role) for the notebook execution.
notebook_execution_name (str | None) – Optional name for the notebook execution.
notebook_params (str | None) – Input parameters in JSON format passed to the EMR notebook at runtime for execution.
notebook_instance_security_group_id (str | None) – The unique identifier of the Amazon EC2 security group to associate with the EMR notebook for this notebook execution.
master_instance_security_group_id (str | None) – Optional unique ID of an EC2 security group to associate with the master instance of the EMR cluster for this notebook execution.
tags (list | None) – Optional list of key value pair to associate with the notebook execution.
waiter_max_attempts (int | None) – Maximum number of tries before failing.
waiter_delay (int | None) – Number of seconds between polling the state of the notebook.
waiter_countdown (int | None) – Total amount of time the operator will wait for the notebook to stop. Defaults to 25 * 60 seconds. (Deprecated. Please use waiter_max_attempts.)
waiter_check_interval_seconds (int | None) – Number of seconds between polling the state of the notebook. Defaults to 60 seconds. (Deprecated. Please use waiter_delay.)
- class airflow.providers.amazon.aws.operators.emr.EmrStopNotebookExecutionOperator(notebook_execution_id, wait_for_completion=False, aws_conn_id='aws_default', waiter_max_attempts=None, waiter_delay=None, waiter_countdown=None, waiter_check_interval_seconds=None, **kwargs)[source]¶
Bases:
airflow.models.BaseOperator
An operator that stops a running EMR notebook execution.
See also
For more information on how to use this operator, take a look at the guide: Stop an EMR notebook execution
- Parameters
notebook_execution_id (str) – The unique identifier of the notebook execution.
wait_for_completion (bool) – If True, the operator will wait for the notebook. to be in a STOPPED or FINISHED state. Defaults to False.
aws_conn_id (str | None) – aws connection to use. If this is None or empty then the default boto3 behaviour is used. If running Airflow in a distributed manner and aws_conn_id is None or empty, then default boto3 configuration would be used (and must be maintained on each worker node).
waiter_max_attempts (int | None) – Maximum number of tries before failing.
waiter_delay (int | None) – Number of seconds between polling the state of the notebook.
waiter_countdown (int | None) – Total amount of time the operator will wait for the notebook to stop. Defaults to 25 * 60 seconds. (Deprecated. Please use waiter_max_attempts.)
waiter_check_interval_seconds (int | None) – Number of seconds between polling the state of the notebook. Defaults to 60 seconds. (Deprecated. Please use waiter_delay.)
- class airflow.providers.amazon.aws.operators.emr.EmrEksCreateClusterOperator(*, virtual_cluster_name, eks_cluster_name, eks_namespace, virtual_cluster_id='', aws_conn_id='aws_default', tags=None, **kwargs)[source]¶
Bases:
airflow.models.BaseOperator
An operator that creates EMR on EKS virtual clusters.
See also
For more information on how to use this operator, take a look at the guide: Create an Amazon EMR EKS virtual cluster
- Parameters
virtual_cluster_name (str) – The name of the EMR EKS virtual cluster to create.
eks_cluster_name (str) – The EKS cluster used by the EMR virtual cluster.
eks_namespace (str) – namespace used by the EKS cluster.
virtual_cluster_id (str) – The EMR on EKS virtual cluster id.
aws_conn_id (str | None) – The Airflow connection used for AWS credentials.
tags (dict | None) – The tags assigned to created cluster. Defaults to None
- class airflow.providers.amazon.aws.operators.emr.EmrContainerOperator(*, name, virtual_cluster_id, execution_role_arn, release_label, job_driver, configuration_overrides=None, client_request_token=None, aws_conn_id='aws_default', wait_for_completion=True, poll_interval=30, max_tries=None, tags=None, max_polling_attempts=None, job_retry_max_attempts=None, deferrable=conf.getboolean('operators', 'default_deferrable', fallback=False), **kwargs)[source]¶
Bases:
airflow.models.BaseOperator
An operator that submits jobs to EMR on EKS virtual clusters.
See also
For more information on how to use this operator, take a look at the guide: Submit a job to an Amazon EMR virtual cluster
- Parameters
name (str) – The name of the job run.
virtual_cluster_id (str) – The EMR on EKS virtual cluster ID
execution_role_arn (str) – The IAM role ARN associated with the job run.
release_label (str) – The Amazon EMR release version to use for the job run.
job_driver (dict) – Job configuration details, e.g. the Spark job parameters.
configuration_overrides (dict | None) – The configuration overrides for the job run, specifically either application configuration or monitoring configuration.
client_request_token (str | None) – The client idempotency token of the job run request. Use this if you want to specify a unique ID to prevent two jobs from getting started. If no token is provided, a UUIDv4 token will be generated for you.
aws_conn_id (str | None) – The Airflow connection used for AWS credentials.
wait_for_completion (bool) – Whether or not to wait in the operator for the job to complete.
poll_interval (int) – Time (in seconds) to wait between two consecutive calls to check query status on EMR
max_tries (int | None) – Deprecated - use max_polling_attempts instead.
max_polling_attempts (int | None) – Maximum number of times to wait for the job run to finish. Defaults to None, which will poll until the job is not in a pending, submitted, or running state.
job_retry_max_attempts (int | None) – Maximum number of times to retry when the EMR job fails. Defaults to None, which disable the retry.
tags (dict | None) – The tags assigned to job runs. Defaults to None
deferrable (bool) – Run operator in the deferrable mode.
- class airflow.providers.amazon.aws.operators.emr.EmrCreateJobFlowOperator(*, aws_conn_id='aws_default', emr_conn_id='emr_default', job_flow_overrides=None, region_name=None, wait_for_completion=False, waiter_max_attempts=None, waiter_delay=None, waiter_countdown=None, waiter_check_interval_seconds=None, deferrable=conf.getboolean('operators', 'default_deferrable', fallback=False), **kwargs)[source]¶
Bases:
airflow.models.BaseOperator
Creates an EMR JobFlow, reading the config from the EMR connection.
A dictionary of JobFlow overrides can be passed that override the config from the connection.
See also
For more information on how to use this operator, take a look at the guide: Create an EMR job flow
- Parameters
aws_conn_id (str | None) – The Airflow connection used for AWS credentials. If this is None or empty then the default boto3 behaviour is used. If running Airflow in a distributed manner and aws_conn_id is None or empty, then default boto3 configuration would be used (and must be maintained on each worker node)
emr_conn_id (str | None) – Amazon Elastic MapReduce Connection. Use to receive an initial Amazon EMR cluster configuration:
boto3.client('emr').run_job_flow
request body. If this is None or empty or the connection does not exist, then an empty initial configuration is used.job_flow_overrides (str | dict[str, Any] | None) – boto3 style arguments or reference to an arguments file (must be ‘.json’) to override specific
emr_conn_id
extra parameters. (templated)region_name (str | None) – Region named passed to EmrHook
wait_for_completion (bool) – Whether to finish task immediately after creation (False) or wait for jobflow completion (True)
waiter_max_attempts (int | None) – Maximum number of tries before failing.
waiter_delay (int | None) – Number of seconds between polling the state of the notebook.
waiter_countdown (int | None) – Max. seconds to wait for jobflow completion (only in combination with wait_for_completion=True, None = no limit) (Deprecated. Please use waiter_max_attempts.)
waiter_check_interval_seconds (int | None) – Number of seconds between polling the jobflow state. Defaults to 60 seconds. (Deprecated. Please use waiter_delay.)
deferrable (bool) – If True, the operator will wait asynchronously for the crawl to complete. This implies waiting for completion. This mode requires aiobotocore module to be installed. (default: False)
- template_fields: Sequence[str] = ('job_flow_overrides', 'waiter_delay', 'waiter_max_attempts')[source]¶
- class airflow.providers.amazon.aws.operators.emr.EmrModifyClusterOperator(*, cluster_id, step_concurrency_level, aws_conn_id='aws_default', **kwargs)[source]¶
Bases:
airflow.models.BaseOperator
An operator that modifies an existing EMR cluster.
See also
For more information on how to use this operator, take a look at the guide: Modify Amazon EMR container
- Parameters
- class airflow.providers.amazon.aws.operators.emr.EmrTerminateJobFlowOperator(*, job_flow_id, aws_conn_id='aws_default', waiter_delay=60, waiter_max_attempts=20, deferrable=conf.getboolean('operators', 'default_deferrable', fallback=False), **kwargs)[source]¶
Bases:
airflow.models.BaseOperator
Operator to terminate EMR JobFlows.
See also
For more information on how to use this operator, take a look at the guide: Terminate an EMR job flow
- Parameters
job_flow_id (str) – id of the JobFlow to terminate. (templated)
aws_conn_id (str | None) – The Airflow connection used for AWS credentials. If this is None or empty then the default boto3 behaviour is used. If running Airflow in a distributed manner and aws_conn_id is None or empty, then default boto3 configuration would be used (and must be maintained on each worker node).
waiter_delay (int) – Time (in seconds) to wait between two consecutive calls to check JobFlow status
waiter_max_attempts (int) – The maximum number of times to poll for JobFlow status.
deferrable (bool) – If True, the operator will wait asynchronously for the crawl to complete. This implies waiting for completion. This mode requires aiobotocore module to be installed. (default: False)
- class airflow.providers.amazon.aws.operators.emr.EmrServerlessCreateApplicationOperator(release_label, job_type, client_request_token='', config=None, wait_for_completion=True, aws_conn_id='aws_default', waiter_countdown=NOTSET, waiter_check_interval_seconds=NOTSET, waiter_max_attempts=NOTSET, waiter_delay=NOTSET, deferrable=conf.getboolean('operators', 'default_deferrable', fallback=False), **kwargs)[source]¶
Bases:
airflow.models.BaseOperator
Operator to create Serverless EMR Application.
See also
For more information on how to use this operator, take a look at the guide: Create an EMR Serverless Application
- Parameters
release_label (str) – The EMR release version associated with the application.
job_type (str) – The type of application you want to start, such as Spark or Hive.
wait_for_completion (bool) – If true, wait for the Application to start before returning. Default to True. If set to False,
waiter_max_attempts
andwaiter_delay
will only be applied when waiting for the application to be in theCREATED
state.client_request_token (str) – The client idempotency token of the application to create. Its value must be unique for each request.
config (dict | None) – Optional dictionary for arbitrary parameters to the boto API create_application call.
aws_conn_id (str | None) – The Airflow connection used for AWS credentials. If this is None or empty then the default boto3 behaviour is used. If running Airflow in a distributed manner and aws_conn_id is None or empty, then default boto3 configuration would be used (and must be maintained on each worker node).
waiter_countdown (int | airflow.utils.types.ArgNotSet) – (deprecated) Total amount of time, in seconds, the operator will wait for the application to start. Defaults to 25 minutes.
waiter_check_interval_seconds (int | airflow.utils.types.ArgNotSet) – (deprecated) Number of seconds between polling the state of the application. Defaults to 60 seconds.
waiter_delay (int | airflow.utils.types.ArgNotSet) – Number of seconds between polling the state of the application.
deferrable (bool) – If True, the operator will wait asynchronously for application to be created. This implies waiting for completion. This mode requires aiobotocore module to be installed. (default: False, but can be overridden in config file by setting default_deferrable to True)
- Waiter_max_attempts
Number of times the waiter should poll the application to check the state. If not set, the waiter will use its default value.
- class airflow.providers.amazon.aws.operators.emr.EmrServerlessStartJobOperator(application_id, execution_role_arn, job_driver, configuration_overrides=None, client_request_token='', config=None, wait_for_completion=True, aws_conn_id='aws_default', name=None, waiter_countdown=NOTSET, waiter_check_interval_seconds=NOTSET, waiter_max_attempts=NOTSET, waiter_delay=NOTSET, deferrable=conf.getboolean('operators', 'default_deferrable', fallback=False), enable_application_ui_links=False, **kwargs)[source]¶
Bases:
airflow.models.BaseOperator
Operator to start EMR Serverless job.
See also
For more information on how to use this operator, take a look at the guide: Start an EMR Serverless Job
- Parameters
application_id (str) – ID of the EMR Serverless application to start.
execution_role_arn (str) – ARN of role to perform action.
job_driver (dict) – Driver that the job runs on.
configuration_overrides (dict | None) – Configuration specifications to override existing configurations.
client_request_token (str) – The client idempotency token of the application to create. Its value must be unique for each request.
config (dict | None) – Optional dictionary for arbitrary parameters to the boto API start_job_run call.
wait_for_completion (bool) – If true, waits for the job to start before returning. Defaults to True. If set to False,
waiter_countdown
andwaiter_check_interval_seconds
will only be applied when waiting for the application be to in theSTARTED
state.aws_conn_id (str | None) – The Airflow connection used for AWS credentials. If this is None or empty then the default boto3 behaviour is used. If running Airflow in a distributed manner and aws_conn_id is None or empty, then default boto3 configuration would be used (and must be maintained on each worker node).
name (str | None) – Name for the EMR Serverless job. If not provided, a default name will be assigned.
waiter_countdown (int | airflow.utils.types.ArgNotSet) – (deprecated) Total amount of time, in seconds, the operator will wait for the job finish. Defaults to 25 minutes.
waiter_check_interval_seconds (int | airflow.utils.types.ArgNotSet) – (deprecated) Number of seconds between polling the state of the job. Defaults to 60 seconds.
waiter_delay (int | airflow.utils.types.ArgNotSet) – Number of seconds between polling the state of the job run.
deferrable (bool) – If True, the operator will wait asynchronously for the crawl to complete. This implies waiting for completion. This mode requires aiobotocore module to be installed. (default: False, but can be overridden in config file by setting default_deferrable to True)
enable_application_ui_links (bool) – If True, the operator will generate one-time links to EMR Serverless application UIs. The generated links will allow any user with access to the DAG to see the Spark or Tez UI or Spark stdout logs. Defaults to False.
- Waiter_max_attempts
Number of times the waiter should poll the application to check the state. If not set, the waiter will use its default value.
- template_fields: Sequence[str] = ('application_id', 'config', 'execution_role_arn', 'job_driver', 'configuration_overrides',...[source]¶
- execute(context, event=None)[source]¶
Derive when creating an operator.
Context is the same dictionary used as when rendering jinja templates.
Refer to get_template_context for more context.
- is_monitoring_in_job_override(config_key, job_override)[source]¶
Check if monitoring is enabled for the job.
Note: This is not compatible with application defaults: https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/default-configs.html
This is used to determine what extra links should be shown.
- class airflow.providers.amazon.aws.operators.emr.EmrServerlessStopApplicationOperator(application_id, wait_for_completion=True, aws_conn_id='aws_default', waiter_countdown=NOTSET, waiter_check_interval_seconds=NOTSET, waiter_max_attempts=NOTSET, waiter_delay=NOTSET, force_stop=False, deferrable=conf.getboolean('operators', 'default_deferrable', fallback=False), **kwargs)[source]¶
Bases:
airflow.models.BaseOperator
Operator to stop an EMR Serverless application.
See also
For more information on how to use this operator, take a look at the guide: Open Application UIs
- Parameters
application_id (str) – ID of the EMR Serverless application to stop.
wait_for_completion (bool) – If true, wait for the Application to stop before returning. Default to True
aws_conn_id (str | None) – The Airflow connection used for AWS credentials. If this is None or empty then the default boto3 behaviour is used. If running Airflow in a distributed manner and aws_conn_id is None or empty, then default boto3 configuration would be used (and must be maintained on each worker node).
waiter_countdown (int | airflow.utils.types.ArgNotSet) – (deprecated) Total amount of time, in seconds, the operator will wait for the application be stopped. Defaults to 5 minutes.
waiter_check_interval_seconds (int | airflow.utils.types.ArgNotSet) – (deprecated) Number of seconds between polling the state of the application. Defaults to 60 seconds.
force_stop (bool) – If set to True, any job for that app that is not in a terminal state will be cancelled. Otherwise, trying to stop an app with running jobs will return an error. If you want to wait for the jobs to finish gracefully, use
airflow.providers.amazon.aws.sensors.emr.EmrServerlessJobSensor
waiter_delay (int | airflow.utils.types.ArgNotSet) – Number of seconds between polling the state of the application. Default is 60 seconds.
deferrable (bool) – If True, the operator will wait asynchronously for the application to stop. This implies waiting for completion. This mode requires aiobotocore module to be installed. (default: False, but can be overridden in config file by setting default_deferrable to True)
- Waiter_max_attempts
Number of times the waiter should poll the application to check the state. Default is 25.
- class airflow.providers.amazon.aws.operators.emr.EmrServerlessDeleteApplicationOperator(application_id, wait_for_completion=True, aws_conn_id='aws_default', waiter_countdown=NOTSET, waiter_check_interval_seconds=NOTSET, waiter_max_attempts=NOTSET, waiter_delay=NOTSET, force_stop=False, deferrable=conf.getboolean('operators', 'default_deferrable', fallback=False), **kwargs)[source]¶
Bases:
EmrServerlessStopApplicationOperator
Operator to delete EMR Serverless application.
See also
For more information on how to use this operator, take a look at the guide: Delete an EMR Serverless Application
- Parameters
application_id (str) – ID of the EMR Serverless application to delete.
wait_for_completion (bool) – If true, wait for the Application to be deleted before returning. Defaults to True. Note that this operator will always wait for the application to be STOPPED first.
aws_conn_id (str | None) – The Airflow connection used for AWS credentials. If this is None or empty then the default boto3 behaviour is used. If running Airflow in a distributed manner and aws_conn_id is None or empty, then default boto3 configuration would be used (and must be maintained on each worker node).
waiter_countdown (int | airflow.utils.types.ArgNotSet) – (deprecated) Total amount of time, in seconds, the operator will wait for each step of first,the application to be stopped, and then deleted. Defaults to 25 minutes.
waiter_check_interval_seconds (int | airflow.utils.types.ArgNotSet) – (deprecated) Number of seconds between polling the state of the application. Defaults to 60 seconds.
waiter_delay (int | airflow.utils.types.ArgNotSet) – Number of seconds between polling the state of the application. Defaults to 60 seconds.
deferrable (bool) – If True, the operator will wait asynchronously for application to be deleted. This implies waiting for completion. This mode requires aiobotocore module to be installed. (default: False, but can be overridden in config file by setting default_deferrable to True)
force_stop (bool) – If set to True, any job for that app that is not in a terminal state will be cancelled. Otherwise, trying to delete an app with running jobs will return an error. If you want to wait for the jobs to finish gracefully, use
airflow.providers.amazon.aws.sensors.emr.EmrServerlessJobSensor
- Waiter_max_attempts
Number of times the waiter should poll the application to check the state. Defaults to 25.