Complete the airflow survey & get a free airflow 3 certification!

Amazon Comprehend

Amazon Comprehend uses natural language processing (NLP) to extract insights about the content of documents. It develops insights by recognizing the entities, key phrases, language, sentiments, and other common elements in a document.

Prerequisite Tasks

To use these operators, you must do a few things:

Generic Parameters

aws_conn_id

Reference to Amazon Web Services Connection ID. If this parameter is set to None then the default boto3 behaviour is used without a connection lookup. Otherwise use the credentials stored in the Connection. Default: aws_default

region_name

AWS Region Name. If this parameter is set to None or omitted then region_name from AWS Connection Extra Parameter will be used. Otherwise use the specified value instead of the connection value. Default: None

verify

Whether or not to verify SSL certificates.

  • False - Do not validate SSL certificates.

  • path/to/cert/bundle.pem - A filename of the CA cert bundle to use. You can specify this argument if you want to use a different CA cert bundle than the one used by botocore.

If this parameter is set to None or is omitted then verify from AWS Connection Extra Parameter will be used. Otherwise use the specified value instead of the connection value. Default: None

botocore_config

The provided dictionary is used to construct a botocore.config.Config. This configuration can be used to configure Avoid Throttling exceptions, timeouts, etc.

Example, for more detail about parameters please have a look botocore.config.Config
{
    "signature_version": "unsigned",
    "s3": {
        "us_east_1_regional_endpoint": True,
    },
    "retries": {
      "mode": "standard",
      "max_attempts": 10,
    },
    "connect_timeout": 300,
    "read_timeout": 300,
    "tcp_keepalive": True,
}

If this parameter is set to None or omitted then config_kwargs from AWS Connection Extra Parameter will be used. Otherwise use the specified value instead of the connection value. Default: None

Note

Specifying an empty dictionary, {}, will overwrite the connection configuration for botocore.config.Config

Operators

Create an Amazon Comprehend Start PII Entities Detection Job

To create an Amazon Comprehend Start PII Entities Detection Job, you can use ComprehendStartPiiEntitiesDetectionJobOperator.

tests/system/amazon/aws/example_comprehend.py[source]

start_pii_entities_detection_job = ComprehendStartPiiEntitiesDetectionJobOperator(
    task_id="start_pii_entities_detection_job",
    input_data_config=input_data_configurations,
    output_data_config=output_data_configurations,
    mode="ONLY_REDACTION",
    data_access_role_arn=test_context[ROLE_ARN_KEY],
    language_code="en",
    start_pii_entities_kwargs=pii_entities_kwargs,
)

Create an Amazon Comprehend Document Classifier

To create an Amazon Comprehend Document Classifier, you can use ComprehendCreateDocumentClassifierOperator.

tests/system/amazon/aws/example_comprehend_document_classifier.py[source]

create_document_classifier = ComprehendCreateDocumentClassifierOperator(
    task_id="create_document_classifier",
    document_classifier_name=classifier_name,
    input_data_config=input_data_configurations,
    output_data_config=output_data_configurations,
    mode="MULTI_CLASS",
    data_access_role_arn=test_context[ROLE_ARN_KEY],
    language_code="en",
    document_classifier_kwargs=document_classifier_kwargs,
)

Sensors

Wait for an Amazon Comprehend Start PII Entities Detection Job

To wait on the state of an Amazon Comprehend Start PII Entities Detection Job until it reaches a terminal state you can use ComprehendStartPiiEntitiesDetectionJobCompletedSensor.

tests/system/amazon/aws/example_comprehend.py[source]

await_start_pii_entities_detection_job = ComprehendStartPiiEntitiesDetectionJobCompletedSensor(
    task_id="await_start_pii_entities_detection_job", job_id=start_pii_entities_detection_job.output
)

Wait for an Amazon Comprehend Document Classifier

To wait on the state of an Amazon Comprehend Document Classifier until it reaches a terminal state you can use ComprehendCreateDocumentClassifierCompletedSensor.

tests/system/amazon/aws/example_comprehend_document_classifier.py[source]

await_create_document_classifier = ComprehendCreateDocumentClassifierCompletedSensor(
    task_id="await_create_document_classifier", document_classifier_arn=create_document_classifier.output
)

Reference

Was this entry helpful?