airflow.providers.amazon.aws.hooks.emr
¶
Module Contents¶
Classes¶
Interact with Amazon Elastic MapReduce Service (EMR). |
|
Interact with Amazon EMR Serverless. |
|
Interact with Amazon EMR Containers (Amazon EMR on EKS). |
- class airflow.providers.amazon.aws.hooks.emr.EmrHook(emr_conn_id=default_conn_name, *args, **kwargs)[source]¶
Bases:
airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook
Interact with Amazon Elastic MapReduce Service (EMR). Provide thick wrapper around
boto3.client("emr")
.- Parameters
emr_conn_id (str | None) – Amazon Elastic MapReduce Connection. This attribute is only necessary when using the
airflow.providers.amazon.aws.hooks.emr.EmrHook.create_job_flow()
.
Additional arguments (such as
aws_conn_id
) may be specified and are passed down to the underlying AwsBaseHook.See also
- get_cluster_id_by_name(emr_cluster_name, cluster_states)[source]¶
Fetch id of EMR cluster with given name and (optional) states. Will return only if single id is found.
See also
- create_job_flow(job_flow_overrides)[source]¶
Create and start running a new cluster (job flow).
See also
This method uses
EmrHook.emr_conn_id
to receive the initial Amazon EMR cluster configuration. IfEmrHook.emr_conn_id
is empty or the connection does not exist, then an empty initial configuration is used.- Parameters
job_flow_overrides (dict[str, Any]) – Is used to overwrite the parameters in the initial Amazon EMR configuration cluster. The resulting configuration will be used in the
EMR.Client.run_job_flow()
.
- add_job_flow_steps(job_flow_id, steps=None, wait_for_completion=False, waiter_delay=None, waiter_max_attempts=None, execution_role_arn=None)[source]¶
Add new steps to a running cluster.
See also
- Parameters
job_flow_id (str) – The id of the job flow to which the steps are being added
steps (list[dict] | str | None) – A list of the steps to be executed by the job flow
wait_for_completion (bool) – If True, wait for the steps to be completed. Default is False
waiter_delay (int | None) – The amount of time in seconds to wait between attempts. Default is 5
waiter_max_attempts (int | None) – The maximum number of attempts to be made. Default is 100
execution_role_arn (str | None) – The ARN of the runtime role for a step on the cluster.
- test_connection()[source]¶
Return failed state for test Amazon Elastic MapReduce Connection (untestable).
We need to overwrite this method because this hook is based on
AwsGenericHook
, otherwise it will try to test connection to AWS STS by using the default boto3 credential strategy.
- class airflow.providers.amazon.aws.hooks.emr.EmrServerlessHook(*args, **kwargs)[source]¶
Bases:
airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook
Interact with Amazon EMR Serverless. Provide thin wrapper around
boto3.client("emr-serverless")
.Additional arguments (such as
aws_conn_id
) may be specified and are passed down to the underlying AwsBaseHook.- waiter(get_state_callable, get_state_args, parse_response, desired_state, failure_states, object_type, action, countdown=25 * 60, check_interval_seconds=60)[source]¶
Will run the sensor until it turns True.
- Parameters
get_state_callable (Callable) – A callable to run until it returns True
get_state_args (dict) – Arguments to pass to get_state_callable
parse_response (list) – Dictionary keys to extract state from response of get_state_callable
desired_state (set) – Wait until the getter returns this value
failure_states (set) – A set of states which indicate failure and should throw an exception if any are reached before the desired_state
object_type (str) – Used for the reporting string. What are you waiting for? (application, job, etc)
action (str) – Used for the reporting string. What action are you waiting for? (created, deleted, etc)
countdown (int) – Total amount of time the waiter should wait for the desired state before timing out (in seconds). Defaults to 25 * 60 seconds.
check_interval_seconds (int) – Number of seconds waiter should wait before attempting to retry get_state_callable. Defaults to 60 seconds.
- class airflow.providers.amazon.aws.hooks.emr.EmrContainerHook(*args, virtual_cluster_id=None, **kwargs)[source]¶
Bases:
airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook
Interact with Amazon EMR Containers (Amazon EMR on EKS). Provide thick wrapper around
boto3.client("emr-containers")
.- Parameters
virtual_cluster_id (str | None) – Cluster ID of the EMR on EKS virtual cluster
Additional arguments (such as
aws_conn_id
) may be specified and are passed down to the underlying AwsBaseHook.- create_emr_on_eks_cluster(virtual_cluster_name, eks_cluster_name, eks_namespace, tags=None)[source]¶
- submit_job(name, execution_role_arn, release_label, job_driver, configuration_overrides=None, client_request_token=None, tags=None)[source]¶
Submit a job to the EMR Containers API and return the job ID. A job run is a unit of work, such as a Spark jar, PySpark script, or SparkSQL query, that you submit to Amazon EMR on EKS.
See also
- Parameters
name (str) – The name of the job run.
execution_role_arn (str) – The IAM role ARN associated with the job run.
release_label (str) – The Amazon EMR release version to use for the job run.
job_driver (dict) – Job configuration details, e.g. the Spark job parameters.
configuration_overrides (dict | None) – The configuration overrides for the job run, specifically either application configuration or monitoring configuration.
client_request_token (str | None) – The client idempotency token of the job run request. Use this if you want to specify a unique ID to prevent two jobs from getting started.
tags (dict | None) – The tags assigned to job runs.
- Returns
The ID of the job run request.
- Return type
- get_job_failure_reason(job_id)[source]¶
Fetch the reason for a job failure (e.g. error message). Returns None or reason string.
- Parameters
job_id (str) – The ID of the job run request.
- check_query_status(job_id)[source]¶
Fetch the status of submitted job run. Returns None or one of valid query states.
- Parameters
job_id (str) – The ID of the job run request.
- poll_query_status(job_id, max_tries=None, poll_interval=30, max_polling_attempts=None)[source]¶
Poll the status of submitted job run until query state reaches final state. Returns one of the final states.
- Parameters
job_id (str) – The ID of the job run request.
max_tries (int | None) – Deprecated - Use max_polling_attempts instead
poll_interval (int) – Time (in seconds) to wait between calls to check query status on EMR
max_polling_attempts (int | None) – Number of times to poll for query state before function exits