airflow.providers.amazon.aws.hooks.emr_containers

Module Contents

class airflow.providers.amazon.aws.hooks.emr_containers.EMRContainerHook(*args, virtual_cluster_id: str = None, **kwargs)[source]

Bases: airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook

Interact with AWS EMR Virtual Cluster to run, poll jobs and return job status Additional arguments (such as aws_conn_id) may be specified and are passed down to the underlying AwsBaseHook.

See also

AwsBaseHook

Parameters

virtual_cluster_id (str) – Cluster ID of the EMR on EKS virtual cluster

INTERMEDIATE_STATES = ['PENDING', 'SUBMITTED', 'RUNNING'][source]
FAILURE_STATES = ['FAILED', 'CANCELLED', 'CANCEL_PENDING'][source]
SUCCESS_STATES = ['COMPLETED'][source]
submit_job(self, name: str, execution_role_arn: str, release_label: str, job_driver: dict, configuration_overrides: Optional[dict] = None, client_request_token: Optional[str] = None)[source]

Submit a job to the EMR Containers API and and return the job ID. A job run is a unit of work, such as a Spark jar, PySpark script, or SparkSQL query, that you submit to Amazon EMR on EKS. See: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr-containers.html#EMRContainers.Client.start_job_run # noqa: E501

Parameters
  • name (str) – The name of the job run.

  • execution_role_arn (str) – The IAM role ARN associated with the job run.

  • release_label (str) – The Amazon EMR release version to use for the job run.

  • job_driver (dict) – Job configuration details, e.g. the Spark job parameters.

  • configuration_overrides (dict) – The configuration overrides for the job run, specifically either application configuration or monitoring configuration.

  • client_request_token (str) – The client idempotency token of the job run request. Use this if you want to specify a unique ID to prevent two jobs from getting started.

Returns

Job ID

get_job_failure_reason(self, job_id: str)[source]

Fetch the reason for a job failure (e.g. error message). Returns None or reason string.

Parameters

job_id (str) – Id of submitted job run

Returns

str

check_query_status(self, job_id: str)[source]

Fetch the status of submitted job run. Returns None or one of valid query states. See: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr-containers.html#EMRContainers.Client.describe_job_run # noqa: E501 :param job_id: Id of submitted job run :type job_id: str :return: str

poll_query_status(self, job_id: str, max_tries: Optional[int] = None, poll_interval: int = 30)[source]

Poll the status of submitted job run until query state reaches final state. Returns one of the final states.

Parameters
  • job_id (str) – Id of submitted job run

  • max_tries (int) – Number of times to poll for query state before function exits

  • poll_interval (int) – Time (in seconds) to wait between calls to check query status on EMR

Returns

str

stop_query(self, job_id: str)[source]

Cancel the submitted job_run

Parameters

job_id (str) – Id of submitted job_run

Returns

dict

Was this entry helpful?