airflow.providers.amazon.aws.operators.emr
¶
Module Contents¶
Classes¶
An operator that adds steps to an existing EMR job_flow. |
|
An operator that starts an EMR notebook execution. |
|
An operator that stops a running EMR notebook execution. |
|
An operator that creates EMR on EKS virtual clusters. |
|
An operator that submits jobs to EMR on EKS virtual clusters. |
|
Creates an EMR JobFlow, reading the config from the EMR connection. |
|
An operator that modifies an existing EMR cluster. |
|
Operator to terminate EMR JobFlows. |
|
Operator to create Serverless EMR Application |
|
Operator to start EMR Serverless job. |
|
Operator to delete EMR Serverless application |
- class airflow.providers.amazon.aws.operators.emr.EmrAddStepsOperator(*, job_flow_id=None, job_flow_name=None, cluster_states=None, aws_conn_id='aws_default', steps=None, wait_for_completion=False, waiter_delay=None, waiter_max_attempts=None, execution_role_arn=None, **kwargs)[source]¶
Bases:
airflow.models.BaseOperator
An operator that adds steps to an existing EMR job_flow.
See also
For more information on how to use this operator, take a look at the guide: Add Steps to an EMR job flow
- Parameters
job_flow_id (str | None) – id of the JobFlow to add steps to. (templated)
job_flow_name (str | None) – name of the JobFlow to add steps to. Use as an alternative to passing job_flow_id. will search for id of JobFlow with matching name in one of the states in param cluster_states. Exactly one cluster like this should exist or will fail. (templated)
cluster_states (list[str] | None) – Acceptable cluster states when searching for JobFlow id by job_flow_name. (templated)
aws_conn_id (str) – aws connection to uses
steps (list[dict] | str | None) – boto3 style steps or reference to a steps file (must be ‘.json’) to be added to the jobflow. (templated)
wait_for_completion (bool) – If True, the operator will wait for all the steps to be completed.
execution_role_arn (str | None) – The ARN of the runtime role for a step on the cluster.
do_xcom_push – if True, job_flow_id is pushed to XCom with key job_flow_id.
- class airflow.providers.amazon.aws.operators.emr.EmrStartNotebookExecutionOperator(editor_id, relative_path, cluster_id, service_role, notebook_execution_name=None, notebook_params=None, notebook_instance_security_group_id=None, master_instance_security_group_id=None, tags=None, wait_for_completion=False, aws_conn_id='aws_default', waiter_max_attempts=NOTSET, waiter_delay=NOTSET, waiter_countdown=25 * 60, waiter_check_interval_seconds=60, **kwargs)[source]¶
Bases:
airflow.models.BaseOperator
An operator that starts an EMR notebook execution.
See also
For more information on how to use this operator, take a look at the guide: Start an EMR notebook execution
- Parameters
editor_id (str) – The unique identifier of the EMR notebook to use for notebook execution.
relative_path (str) – The path and file name of the notebook file for this execution, relative to the path specified for the EMR notebook.
cluster_id (str) – The unique identifier of the EMR cluster the notebook is attached to.
service_role (str) – The name or ARN of the IAM role that is used as the service role for Amazon EMR (the EMR role) for the notebook execution.
notebook_execution_name (str | None) – Optional name for the notebook execution.
notebook_params (str | None) – Input parameters in JSON format passed to the EMR notebook at runtime for execution.
tags (list | None) – Optional list of key value pair to associate with the notebook execution.
waiter_max_attempts (int | None | ArgNotSet) – Maximum number of tries before failing.
waiter_delay (int | None | ArgNotSet) – Number of seconds between polling the state of the notebook.
waiter_countdown (int) – Total amount of time the operator will wait for the notebook to stop. Defaults to 25 * 60 seconds. (Deprecated. Please use waiter_max_attempts.)
waiter_check_interval_seconds (int) – Number of seconds between polling the state of the notebook. Defaults to 60 seconds. (Deprecated. Please use waiter_delay.)
- Param
notebook_instance_security_group_id: The unique identifier of the Amazon EC2 security group to associate with the EMR notebook for this notebook execution.
- Param
master_instance_security_group_id: Optional unique ID of an EC2 security group to associate with the master instance of the EMR cluster for this notebook execution.
- class airflow.providers.amazon.aws.operators.emr.EmrStopNotebookExecutionOperator(notebook_execution_id, wait_for_completion=False, aws_conn_id='aws_default', waiter_max_attempts=NOTSET, waiter_delay=NOTSET, waiter_countdown=25 * 60, waiter_check_interval_seconds=60, **kwargs)[source]¶
Bases:
airflow.models.BaseOperator
An operator that stops a running EMR notebook execution.
See also
For more information on how to use this operator, take a look at the guide: Stop an EMR notebook execution
- Parameters
notebook_execution_id (str) – The unique identifier of the notebook execution.
wait_for_completion (bool) – If True, the operator will wait for the notebook. to be in a STOPPED or FINISHED state. Defaults to False.
aws_conn_id (str) – aws connection to use.
waiter_max_attempts (int | None | ArgNotSet) – Maximum number of tries before failing.
waiter_delay (int | None | ArgNotSet) – Number of seconds between polling the state of the notebook.
waiter_countdown (int) – Total amount of time the operator will wait for the notebook to stop. Defaults to 25 * 60 seconds. (Deprecated. Please use waiter_max_attempts.)
waiter_check_interval_seconds (int) – Number of seconds between polling the state of the notebook. Defaults to 60 seconds. (Deprecated. Please use waiter_delay.)
- class airflow.providers.amazon.aws.operators.emr.EmrEksCreateClusterOperator(*, virtual_cluster_name, eks_cluster_name, eks_namespace, virtual_cluster_id='', aws_conn_id='aws_default', tags=None, **kwargs)[source]¶
Bases:
airflow.models.BaseOperator
An operator that creates EMR on EKS virtual clusters.
See also
For more information on how to use this operator, take a look at the guide: Create an Amazon EMR EKS virtual cluster
- Parameters
virtual_cluster_name (str) – The name of the EMR EKS virtual cluster to create.
eks_cluster_name (str) – The EKS cluster used by the EMR virtual cluster.
eks_namespace (str) – namespace used by the EKS cluster.
virtual_cluster_id (str) – The EMR on EKS virtual cluster id.
aws_conn_id (str) – The Airflow connection used for AWS credentials.
tags (dict | None) – The tags assigned to created cluster. Defaults to None
- class airflow.providers.amazon.aws.operators.emr.EmrContainerOperator(*, name, virtual_cluster_id, execution_role_arn, release_label, job_driver, configuration_overrides=None, client_request_token=None, aws_conn_id='aws_default', wait_for_completion=True, poll_interval=30, max_tries=None, tags=None, max_polling_attempts=None, **kwargs)[source]¶
Bases:
airflow.models.BaseOperator
An operator that submits jobs to EMR on EKS virtual clusters.
See also
For more information on how to use this operator, take a look at the guide: Submit a job to an Amazon EMR virtual cluster
- Parameters
name (str) – The name of the job run.
virtual_cluster_id (str) – The EMR on EKS virtual cluster ID
execution_role_arn (str) – The IAM role ARN associated with the job run.
release_label (str) – The Amazon EMR release version to use for the job run.
job_driver (dict) – Job configuration details, e.g. the Spark job parameters.
configuration_overrides (dict | None) – The configuration overrides for the job run, specifically either application configuration or monitoring configuration.
client_request_token (str | None) – The client idempotency token of the job run request. Use this if you want to specify a unique ID to prevent two jobs from getting started. If no token is provided, a UUIDv4 token will be generated for you.
aws_conn_id (str) – The Airflow connection used for AWS credentials.
wait_for_completion (bool) – Whether or not to wait in the operator for the job to complete.
poll_interval (int) – Time (in seconds) to wait between two consecutive calls to check query status on EMR
max_tries (int | None) – Deprecated - use max_polling_attempts instead.
max_polling_attempts (int | None) – Maximum number of times to wait for the job run to finish. Defaults to None, which will poll until the job is not in a pending, submitted, or running state.
tags (dict | None) – The tags assigned to job runs. Defaults to None
- class airflow.providers.amazon.aws.operators.emr.EmrCreateJobFlowOperator(*, aws_conn_id='aws_default', emr_conn_id='emr_default', job_flow_overrides=None, region_name=None, wait_for_completion=False, waiter_max_attempts=NOTSET, waiter_delay=NOTSET, waiter_countdown=None, waiter_check_interval_seconds=60, **kwargs)[source]¶
Bases:
airflow.models.BaseOperator
Creates an EMR JobFlow, reading the config from the EMR connection. A dictionary of JobFlow overrides can be passed that override the config from the connection.
See also
For more information on how to use this operator, take a look at the guide: Create an EMR job flow
- Parameters
aws_conn_id (str) – The Airflow connection used for AWS credentials. If this is None or empty then the default boto3 behaviour is used. If running Airflow in a distributed manner and aws_conn_id is None or empty, then default boto3 configuration would be used (and must be maintained on each worker node)
emr_conn_id (str | None) – Amazon Elastic MapReduce Connection. Use to receive an initial Amazon EMR cluster configuration:
boto3.client('emr').run_job_flow
request body. If this is None or empty or the connection does not exist, then an empty initial configuration is used.job_flow_overrides (str | dict[str, Any] | None) – boto3 style arguments or reference to an arguments file (must be ‘.json’) to override specific
emr_conn_id
extra parameters. (templated)region_name (str | None) – Region named passed to EmrHook
wait_for_completion (bool) – Whether to finish task immediately after creation (False) or wait for jobflow completion (True)
waiter_max_attempts (int | None | ArgNotSet) – Maximum number of tries before failing.
waiter_delay (int | None | ArgNotSet) – Number of seconds between polling the state of the notebook.
waiter_countdown (int | None) – Max. seconds to wait for jobflow completion (only in combination with wait_for_completion=True, None = no limit) (Deprecated. Please use waiter_max_attempts.)
waiter_check_interval_seconds (int) – Number of seconds between polling the jobflow state. Defaults to 60 seconds. (Deprecated. Please use waiter_delay.)
- template_fields: Sequence[str] = ('job_flow_overrides', 'waiter_delay', 'waiter_max_attempts')[source]¶
- class airflow.providers.amazon.aws.operators.emr.EmrModifyClusterOperator(*, cluster_id, step_concurrency_level, aws_conn_id='aws_default', **kwargs)[source]¶
Bases:
airflow.models.BaseOperator
An operator that modifies an existing EMR cluster.
See also
For more information on how to use this operator, take a look at the guide: Modify Amazon EMR container
- Parameters
- class airflow.providers.amazon.aws.operators.emr.EmrTerminateJobFlowOperator(*, job_flow_id, aws_conn_id='aws_default', **kwargs)[source]¶
Bases:
airflow.models.BaseOperator
Operator to terminate EMR JobFlows.
See also
For more information on how to use this operator, take a look at the guide: Terminate an EMR job flow
- Parameters
- class airflow.providers.amazon.aws.operators.emr.EmrServerlessCreateApplicationOperator(release_label, job_type, client_request_token='', config=None, wait_for_completion=True, aws_conn_id='aws_default', waiter_countdown=25 * 60, waiter_check_interval_seconds=60, **kwargs)[source]¶
Bases:
airflow.models.BaseOperator
Operator to create Serverless EMR Application
See also
For more information on how to use this operator, take a look at the guide: Create an EMR Serverless Application
- Parameters
release_label (str) – The EMR release version associated with the application.
job_type (str) – The type of application you want to start, such as Spark or Hive.
wait_for_completion (bool) – If true, wait for the Application to start before returning. Default to True. If set to False,
waiter_countdown
andwaiter_check_interval_seconds
will only be applied when waiting for the application to be in theCREATED
state.client_request_token (str) – The client idempotency token of the application to create. Its value must be unique for each request.
config (dict | None) – Optional dictionary for arbitrary parameters to the boto API create_application call.
aws_conn_id (str) – AWS connection to use
waiter_countdown (int) – Total amount of time, in seconds, the operator will wait for the application to start. Defaults to 25 minutes.
waiter_check_interval_seconds (int) – Number of seconds between polling the state of the application. Defaults to 60 seconds.
- class airflow.providers.amazon.aws.operators.emr.EmrServerlessStartJobOperator(application_id, execution_role_arn, job_driver, configuration_overrides, client_request_token='', config=None, wait_for_completion=True, aws_conn_id='aws_default', name=None, waiter_countdown=25 * 60, waiter_check_interval_seconds=60, **kwargs)[source]¶
Bases:
airflow.models.BaseOperator
Operator to start EMR Serverless job.
See also
For more information on how to use this operator, take a look at the guide: Start an EMR Serverless Job
- Parameters
application_id (str) – ID of the EMR Serverless application to start.
execution_role_arn (str) – ARN of role to perform action.
job_driver (dict) – Driver that the job runs on.
configuration_overrides (dict | None) – Configuration specifications to override existing configurations.
client_request_token (str) – The client idempotency token of the application to create. Its value must be unique for each request.
config (dict | None) – Optional dictionary for arbitrary parameters to the boto API start_job_run call.
wait_for_completion (bool) – If true, waits for the job to start before returning. Defaults to True. If set to False,
waiter_countdown
andwaiter_check_interval_seconds
will only be applied when waiting for the application be to in theSTARTED
state.aws_conn_id (str) – AWS connection to use.
name (str | None) – Name for the EMR Serverless job. If not provided, a default name will be assigned.
waiter_countdown (int) – Total amount of time, in seconds, the operator will wait for the job finish. Defaults to 25 minutes.
waiter_check_interval_seconds (int) – Number of seconds between polling the state of the job. Defaults to 60 seconds.
- class airflow.providers.amazon.aws.operators.emr.EmrServerlessDeleteApplicationOperator(application_id, wait_for_completion=True, aws_conn_id='aws_default', waiter_countdown=25 * 60, waiter_check_interval_seconds=60, **kwargs)[source]¶
Bases:
airflow.models.BaseOperator
Operator to delete EMR Serverless application
See also
For more information on how to use this operator, take a look at the guide: Delete an EMR Serverless Application
- Parameters
application_id (str) – ID of the EMR Serverless application to delete.
wait_for_completion (bool) – If true, wait for the Application to start before returning. Default to True
aws_conn_id (str) – AWS connection to use
waiter_countdown (int) – Total amount of time, in seconds, the operator will wait for the application be deleted. Defaults to 25 minutes.
waiter_check_interval_seconds (int) – Number of seconds between polling the state of the application. Defaults to 60 seconds.