airflow.providers.amazon.aws.hooks.emr

Module Contents

Classes

EmrHook

Interact with AWS EMR. emr_conn_id is only necessary for using the

EmrServerlessHook

Interact with EMR Serverless API.

EmrContainerHook

Interact with AWS EMR Virtual Cluster to run, poll jobs and return job status

class airflow.providers.amazon.aws.hooks.emr.EmrHook(emr_conn_id=default_conn_name, *args, **kwargs)[source]

Bases: airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook

Interact with AWS EMR. emr_conn_id is only necessary for using the create_job_flow method.

Additional arguments (such as aws_conn_id) may be specified and are passed down to the underlying AwsBaseHook.

See also

AwsBaseHook

conn_name_attr = emr_conn_id[source]
default_conn_name = emr_default[source]
conn_type = emr[source]
hook_name = Amazon Elastic MapReduce[source]
get_cluster_id_by_name(emr_cluster_name, cluster_states)[source]

Fetch id of EMR cluster with given name and (optional) states. Will return only if single id is found.

Parameters
  • emr_cluster_name (str) – Name of a cluster to find

  • cluster_states (list[str]) – State(s) of cluster to find

Returns

id of the EMR cluster

Return type

str | None

create_job_flow(job_flow_overrides)[source]

Creates a job flow using the config from the EMR connection. Keys of the json extra hash may have the arguments of the boto3 run_job_flow method. Overrides for this config may be passed as the job_flow_overrides.

class airflow.providers.amazon.aws.hooks.emr.EmrServerlessHook(*args, **kwargs)[source]

Bases: airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook

Interact with EMR Serverless API.

Additional arguments (such as aws_conn_id) may be specified and are passed down to the underlying AwsBaseHook.

See also

AwsBaseHook

JOB_INTERMEDIATE_STATES[source]
JOB_FAILURE_STATES[source]
JOB_SUCCESS_STATES[source]
JOB_TERMINAL_STATES[source]
APPLICATION_INTERMEDIATE_STATES[source]
APPLICATION_FAILURE_STATES[source]
APPLICATION_SUCCESS_STATES[source]
conn()[source]

Get the underlying boto3 EmrServerlessAPIService client (cached)

waiter(get_state_callable, get_state_args, parse_response, desired_state, failure_states, object_type, action, countdown=25 * 60, check_interval_seconds=60)[source]

Will run the sensor until it turns True.

Parameters
  • get_state_callable (Callable) – A callable to run until it returns True

  • get_state_args (dict) – Arguments to pass to get_state_callable

  • parse_response (list) – Dictionary keys to extract state from response of get_state_callable

  • desired_state (set) – Wait until the getter returns this value

  • failure_states (set) – A set of states which indicate failure and should throw an exception if any are reached before the desired_state

  • object_type (str) – Used for the reporting string. What are you waiting for? (application, job, etc)

  • action (str) – Used for the reporting string. What action are you waiting for? (created, deleted, etc)

  • countdown (int) – Total amount of time the waiter should wait for the desired state before timing out (in seconds). Defaults to 25 * 60 seconds.

  • check_interval_seconds (int) – Number of seconds waiter should wait before attempting to retry get_state_callable. Defaults to 60 seconds.

get_state(response, keys)[source]
class airflow.providers.amazon.aws.hooks.emr.EmrContainerHook(*args, virtual_cluster_id=None, **kwargs)[source]

Bases: airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook

Interact with AWS EMR Virtual Cluster to run, poll jobs and return job status Additional arguments (such as aws_conn_id) may be specified and are passed down to the underlying AwsBaseHook.

See also

AwsBaseHook

Parameters

virtual_cluster_id (str | None) – Cluster ID of the EMR on EKS virtual cluster

INTERMEDIATE_STATES = ['PENDING', 'SUBMITTED', 'RUNNING'][source]
FAILURE_STATES = ['FAILED', 'CANCELLED', 'CANCEL_PENDING'][source]
SUCCESS_STATES = ['COMPLETED'][source]
TERMINAL_STATES = ['COMPLETED', 'FAILED', 'CANCELLED', 'CANCEL_PENDING'][source]
create_emr_on_eks_cluster(virtual_cluster_name, eks_cluster_name, eks_namespace, tags=None)[source]
submit_job(name, execution_role_arn, release_label, job_driver, configuration_overrides=None, client_request_token=None, tags=None)[source]

Submit a job to the EMR Containers API and return the job ID. A job run is a unit of work, such as a Spark jar, PySpark script, or SparkSQL query, that you submit to Amazon EMR on EKS. See: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr-containers.html#EMRContainers.Client.start_job_run # noqa: E501

Parameters
  • name (str) – The name of the job run.

  • execution_role_arn (str) – The IAM role ARN associated with the job run.

  • release_label (str) – The Amazon EMR release version to use for the job run.

  • job_driver (dict) – Job configuration details, e.g. the Spark job parameters.

  • configuration_overrides (dict | None) – The configuration overrides for the job run, specifically either application configuration or monitoring configuration.

  • client_request_token (str | None) – The client idempotency token of the job run request. Use this if you want to specify a unique ID to prevent two jobs from getting started.

  • tags (dict | None) – The tags assigned to job runs.

Returns

Job ID

Return type

str

get_job_failure_reason(job_id)[source]

Fetch the reason for a job failure (e.g. error message). Returns None or reason string.

Parameters

job_id (str) – Id of submitted job run

Returns

str

Return type

str | None

check_query_status(job_id)[source]

Fetch the status of submitted job run. Returns None or one of valid query states. See: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr-containers.html#EMRContainers.Client.describe_job_run # noqa: E501 :param job_id: Id of submitted job run :return: str

poll_query_status(job_id, max_tries=None, poll_interval=30, max_polling_attempts=None)[source]

Poll the status of submitted job run until query state reaches final state. Returns one of the final states.

Parameters
  • job_id (str) – Id of submitted job run

  • max_tries (int | None) – Deprecated - Use max_polling_attempts instead

  • poll_interval (int) – Time (in seconds) to wait between calls to check query status on EMR

  • max_polling_attempts (int | None) – Number of times to poll for query state before function exits

Returns

str

Return type

str | None

stop_query(job_id)[source]

Cancel the submitted job_run

Parameters

job_id (str) – Id of submitted job_run

Returns

dict

Return type

dict

Was this entry helpful?