airflow.providers.amazon.aws.hooks.emr¶
Classes¶
Interact with Amazon Elastic MapReduce Service (EMR). |
|
Interact with Amazon EMR Serverless. |
|
Interact with Amazon EMR Containers (Amazon EMR on EKS). |
Functions¶
|
Module Contents¶
- class airflow.providers.amazon.aws.hooks.emr.EmrHook(emr_conn_id=default_conn_name, *args, **kwargs)[source]¶
Bases:
airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHookInteract with Amazon Elastic MapReduce Service (EMR).
Provide thick wrapper around
boto3.client("emr").- Parameters:
emr_conn_id (str | None) – Amazon Elastic MapReduce Connection. This attribute is only necessary when using the
airflow.providers.amazon.aws.hooks.emr.EmrHook.create_job_flow().
Additional arguments (such as
aws_conn_id) may be specified and are passed down to the underlying AwsBaseHook.See also
- get_cluster_id_by_name(emr_cluster_name, cluster_states)[source]¶
Fetch id of EMR cluster with given name and (optional) states; returns only if single id is found.
See also
- create_job_flow(job_flow_overrides)[source]¶
Create and start running a new cluster (job flow).
See also
This method uses
EmrHook.emr_conn_idto receive the initial Amazon EMR cluster configuration. IfEmrHook.emr_conn_idis empty or the connection does not exist, then an empty initial configuration is used.- Parameters:
job_flow_overrides (dict[str, Any]) – Is used to overwrite the parameters in the initial Amazon EMR configuration cluster. The resulting configuration will be used in the
EMR.Client.run_job_flow().
- add_job_flow_steps(job_flow_id, steps=None, wait_for_completion=False, waiter_delay=None, waiter_max_attempts=None, execution_role_arn=None)[source]¶
Add new steps to a running cluster.
See also
- Parameters:
job_flow_id (str) – The id of the job flow to which the steps are being added
steps (list[dict] | str | None) – A list of the steps to be executed by the job flow
wait_for_completion (bool) – If True, wait for the steps to be completed. Default is False
waiter_delay (int | None) – The amount of time in seconds to wait between attempts. Default is 5
waiter_max_attempts (int | None) – The maximum number of attempts to be made. Default is 100
execution_role_arn (str | None) – The ARN of the runtime role for a step on the cluster.
- test_connection()[source]¶
Return failed state for test Amazon Elastic MapReduce Connection (untestable).
We need to overwrite this method because this hook is based on
AwsGenericHook, otherwise it will try to test connection to AWS STS by using the default boto3 credential strategy.
- class airflow.providers.amazon.aws.hooks.emr.EmrServerlessHook(*args, **kwargs)[source]¶
Bases:
airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHookInteract with Amazon EMR Serverless.
Provide thin wrapper around
boto3.client("emr-serverless").Additional arguments (such as
aws_conn_id) may be specified and are passed down to the underlying AwsBaseHook.- cancel_running_jobs(application_id, waiter_config=None, wait_for_completion=True)[source]¶
Cancel jobs in an intermediate state, and return the number of cancelled jobs.
If wait_for_completion is True, then the method will wait until all jobs are cancelled before returning.
Note: if new jobs are triggered while this operation is ongoing, it’s going to time out and return an error.
- class airflow.providers.amazon.aws.hooks.emr.EmrContainerHook(*args, virtual_cluster_id=None, **kwargs)[source]¶
Bases:
airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHookInteract with Amazon EMR Containers (Amazon EMR on EKS).
Provide thick wrapper around
boto3.client("emr-containers").- Parameters:
virtual_cluster_id (str | None) – Cluster ID of the EMR on EKS virtual cluster
Additional arguments (such as
aws_conn_id) may be specified and are passed down to the underlying AwsBaseHook.- create_emr_on_eks_cluster(virtual_cluster_name, eks_cluster_name, eks_namespace, tags=None)[source]¶
- submit_job(name, execution_role_arn, release_label, job_driver, configuration_overrides=None, client_request_token=None, tags=None, retry_max_attempts=None)[source]¶
Submit a job to the EMR Containers API and return the job ID.
A job run is a unit of work, such as a Spark jar, PySpark script, or SparkSQL query, that you submit to Amazon EMR on EKS.
See also
- Parameters:
name (str) – The name of the job run.
execution_role_arn (str) – The IAM role ARN associated with the job run.
release_label (str) – The Amazon EMR release version to use for the job run.
job_driver (dict) – Job configuration details, e.g. the Spark job parameters.
configuration_overrides (dict | None) – The configuration overrides for the job run, specifically either application configuration or monitoring configuration.
client_request_token (str | None) – The client idempotency token of the job run request. Use this if you want to specify a unique ID to prevent two jobs from getting started.
tags (dict | None) – The tags assigned to job runs.
retry_max_attempts (int | None) – The maximum number of attempts on the job’s driver.
- Returns:
The ID of the job run request.
- Return type:
- get_job_failure_reason(job_id)[source]¶
Fetch the reason for a job failure (e.g. error message). Returns None or reason string.
- Parameters:
job_id (str) – The ID of the job run request.
- check_query_status(job_id)[source]¶
Fetch the status of submitted job run. Returns None or one of valid query states.
- Parameters:
job_id (str) – The ID of the job run request.