AWS Glue DataBrew

AWS Glue DataBrew is a visual data preparation tool that makes it easier for data analysts and data scientists to clean and normalize data to prepare it for analytics and machine learning (ML). You can choose from over 250 prebuilt transformations to automate data preparation tasks, all without the need to write any code. You can automate filtering anomalies, converting data to standard formats and correcting invalid values, and other tasks. After your data is ready, you can immediately use it for analytics and ML projects.

Prerequisite Tasks

To use these operators, you must do a few things:

Generic Parameters

aws_conn_id

Reference to Amazon Web Services Connection ID. If this parameter is set to None then the default boto3 behaviour is used without a connection lookup. Otherwise use the credentials stored in the Connection. Default: aws_default

region_name

AWS Region Name. If this parameter is set to None or omitted then region_name from AWS Connection Extra Parameter will be used. Otherwise use the specified value instead of the connection value. Default: None

verify

Whether or not to verify SSL certificates.

  • False - Do not validate SSL certificates.

  • path/to/cert/bundle.pem - A filename of the CA cert bundle to use. You can specify this argument if you want to use a different CA cert bundle than the one used by botocore.

If this parameter is set to None or is omitted then verify from AWS Connection Extra Parameter will be used. Otherwise use the specified value instead of the connection value. Default: None

botocore_config

The provided dictionary is used to construct a botocore.config.Config. This configuration can be used to configure Avoid Throttling exceptions, timeouts, etc.

Example, for more detail about parameters please have a look botocore.config.Config
{
    "signature_version": "unsigned",
    "s3": {
        "us_east_1_regional_endpoint": True,
    },
    "retries": {
      "mode": "standard",
      "max_attempts": 10,
    },
    "connect_timeout": 300,
    "read_timeout": 300,
    "tcp_keepalive": True,
}

If this parameter is set to None or omitted then config_kwargs from AWS Connection Extra Parameter will be used. Otherwise use the specified value instead of the connection value. Default: None

Note

Specifying an empty dictionary, {}, will overwrite the connection configuration for botocore.config.Config

Operators

Start an AWS Glue DataBrew job

To submit a new AWS Glue DataBrew job you can use GlueDataBrewStartJobOperator.

tests/system/amazon/aws/example_glue_databrew.py

start_job = GlueDataBrewStartJobOperator(task_id="startjob", job_name=job_name, waiter_delay=15)

Was this entry helpful?