airflow.providers.amazon.aws.hooks.emr¶
Module Contents¶
Classes¶
Interact with Amazon Elastic MapReduce Service. |
|
Interact with EMR Serverless API. |
|
Interact with AWS EMR Virtual Cluster to run, poll jobs and return job status |
- class airflow.providers.amazon.aws.hooks.emr.EmrHook(emr_conn_id=default_conn_name, *args, **kwargs)[source]¶
Bases:
airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHookInteract with Amazon Elastic MapReduce Service.
- Parameters
emr_conn_id (str | None) – Amazon Elastic MapReduce Connection. This attribute is only necessary when using the
create_job_flow()method.
Additional arguments (such as
aws_conn_id) may be specified and are passed down to the underlying AwsBaseHook.See also
- get_cluster_id_by_name(emr_cluster_name, cluster_states)[source]¶
Fetch id of EMR cluster with given name and (optional) states. Will return only if single id is found.
- create_job_flow(job_flow_overrides)[source]¶
Create and start running a new cluster (job flow).
This method uses
EmrHook.emr_conn_idto receive the initial Amazon EMR cluster configuration. IfEmrHook.emr_conn_idis empty or the connection does not exist, then an empty initial configuration is used.
- add_job_flow_steps(job_flow_id, steps=None, wait_for_completion=False)[source]¶
Add new steps to a running cluster.
- test_connection()[source]¶
Return failed state for test Amazon Elastic MapReduce Connection (untestable).
We need to overwrite this method because this hook is based on
AwsGenericHook, otherwise it will try to test connection to AWS STS by using the default boto3 credential strategy.
- class airflow.providers.amazon.aws.hooks.emr.EmrServerlessHook(*args, **kwargs)[source]¶
Bases:
airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHookInteract with EMR Serverless API.
Additional arguments (such as
aws_conn_id) may be specified and are passed down to the underlying AwsBaseHook.See also
- waiter(get_state_callable, get_state_args, parse_response, desired_state, failure_states, object_type, action, countdown=25 * 60, check_interval_seconds=60)[source]¶
Will run the sensor until it turns True.
- Parameters
get_state_callable (Callable) – A callable to run until it returns True
get_state_args (dict) – Arguments to pass to get_state_callable
parse_response (list) – Dictionary keys to extract state from response of get_state_callable
desired_state (set) – Wait until the getter returns this value
failure_states (set) – A set of states which indicate failure and should throw an exception if any are reached before the desired_state
object_type (str) – Used for the reporting string. What are you waiting for? (application, job, etc)
action (str) – Used for the reporting string. What action are you waiting for? (created, deleted, etc)
countdown (int) – Total amount of time the waiter should wait for the desired state before timing out (in seconds). Defaults to 25 * 60 seconds.
check_interval_seconds (int) – Number of seconds waiter should wait before attempting to retry get_state_callable. Defaults to 60 seconds.
- class airflow.providers.amazon.aws.hooks.emr.EmrContainerHook(*args, virtual_cluster_id=None, **kwargs)[source]¶
Bases:
airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHookInteract with AWS EMR Virtual Cluster to run, poll jobs and return job status Additional arguments (such as
aws_conn_id) may be specified and are passed down to the underlying AwsBaseHook.See also
- Parameters
virtual_cluster_id (str | None) – Cluster ID of the EMR on EKS virtual cluster
- create_emr_on_eks_cluster(virtual_cluster_name, eks_cluster_name, eks_namespace, tags=None)[source]¶
- submit_job(name, execution_role_arn, release_label, job_driver, configuration_overrides=None, client_request_token=None, tags=None)[source]¶
Submit a job to the EMR Containers API and return the job ID. A job run is a unit of work, such as a Spark jar, PySpark script, or SparkSQL query, that you submit to Amazon EMR on EKS. See: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr-containers.html#EMRContainers.Client.start_job_run # noqa: E501
- Parameters
name (str) – The name of the job run.
execution_role_arn (str) – The IAM role ARN associated with the job run.
release_label (str) – The Amazon EMR release version to use for the job run.
job_driver (dict) – Job configuration details, e.g. the Spark job parameters.
configuration_overrides (dict | None) – The configuration overrides for the job run, specifically either application configuration or monitoring configuration.
client_request_token (str | None) – The client idempotency token of the job run request. Use this if you want to specify a unique ID to prevent two jobs from getting started.
tags (dict | None) – The tags assigned to job runs.
- Returns
Job ID
- Return type
- get_job_failure_reason(job_id)[source]¶
Fetch the reason for a job failure (e.g. error message). Returns None or reason string.
- check_query_status(job_id)[source]¶
Fetch the status of submitted job run. Returns None or one of valid query states. See: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr-containers.html#EMRContainers.Client.describe_job_run # noqa: E501 :param job_id: Id of submitted job run :return: str