Writing Logs

Writing Logs Locally

Users can specify the directory to place log files in airflow.cfg using base_log_folder. By default, logs are placed in the AIRFLOW_HOME directory.

The following convention is followed while naming logs: {dag_id}/{task_id}/{execution_date}/{try_number}.log

In addition, users can supply a remote location to store current logs and backups.

In the Airflow Web UI, local logs take precedence over remote logs. If local logs can not be found or accessed, the remote logs will be displayed. Note that logs are only sent to remote storage once a task is complete (including failure); In other words, remote logs for running tasks are unavailable.

Before you begin

Remote logging uses an existing Airflow connection to read or write logs. If you don’t have a connection properly setup, this process will fail.

Writing Logs to Amazon S3

Enabling remote logging

To enable this feature, airflow.cfg must be configured as follows:

[core]
# Airflow can store logs remotely in AWS S3. Users must supply a remote
# location URL (starting with either 's3://...') and an Airflow connection
# id that provides access to the storage location.
remote_logging = True
remote_base_log_folder = s3://my-bucket/path/to/logs
remote_log_conn_id = MyS3Conn
# Use server-side encryption for logs stored in S3
encrypt_s3_logs = False

In the above example, Airflow will try to use S3Hook('MyS3Conn').

Writing Logs to Azure Blob Storage

Airflow can be configured to read and write task logs in Azure Blob Storage.

Follow the steps below to enable Azure Blob Storage logging:

  1. Airflow’s logging system requires a custom .py file to be located in the PYTHONPATH, so that it’s importable from Airflow. Start by creating a directory to store the config file, $AIRFLOW_HOME/config is recommended.

  2. Create empty files called $AIRFLOW_HOME/config/log_config.py and $AIRFLOW_HOME/config/__init__.py.

  3. Copy the contents of airflow/config_templates/airflow_local_settings.py into the log_config.py file created in Step 2.

  4. Customize the following portions of the template:

    # wasb buckets should start with "wasb" just to help Airflow select correct handler
    REMOTE_BASE_LOG_FOLDER = 'wasb-<whatever you want here>'
    
    # Rename DEFAULT_LOGGING_CONFIG to LOGGING CONFIG
    LOGGING_CONFIG = ...
    
  5. Make sure a Azure Blob Storage (Wasb) connection hook has been defined in Airflow. The hook should have read and write access to the Azure Blob Storage bucket defined above in REMOTE_BASE_LOG_FOLDER.

  6. Update $AIRFLOW_HOME/airflow.cfg to contain:

    remote_logging = True
    logging_config_class = log_config.LOGGING_CONFIG
    remote_log_conn_id = <name of the Azure Blob Storage connection>
    
  7. Restart the Airflow webserver and scheduler, and trigger (or wait for) a new task execution.

  8. Verify that logs are showing up for newly executed tasks in the bucket you’ve defined.

Writing Logs to Google Cloud Storage

Follow the steps below to enable Google Cloud Storage logging.

To enable this feature, airflow.cfg must be configured as in this example:

[core]
# Airflow can store logs remotely in AWS S3, Google Cloud Storage or Elastic Search.
# Users must supply an Airflow connection id that provides access to the storage
# location. If remote_logging is set to true, see UPDATING.md for additional
# configuration requirements.
remote_logging = True
remote_base_log_folder = gs://my-bucket/path/to/logs
remote_log_conn_id = MyGCSConn
  1. Install the gcp package first, like so: pip install 'apache-airflow[gcp]'.

  2. Make sure a Google Cloud Platform connection hook has been defined in Airflow. The hook should have read and write access to the Google Cloud Storage bucket defined above in remote_base_log_folder.

  3. Restart the Airflow webserver and scheduler, and trigger (or wait for) a new task execution.

  4. Verify that logs are showing up for newly executed tasks in the bucket you’ve defined.

  5. Verify that the Google Cloud Storage viewer is working in the UI. Pull up a newly executed task, and verify that you see something like:

*** Reading remote log from gs://<bucket where logs should be persisted>/example_bash_operator/run_this_last/2017-10-03T00:00:00/16.log.
[2017-10-03 21:57:50,056] {cli.py:377} INFO - Running on host chrisr-00532
[2017-10-03 21:57:50,093] {base_task_runner.py:115} INFO - Running: ['bash', '-c', u'airflow run example_bash_operator run_this_last 2017-10-03T00:00:00 --job_id 47 --raw -sd DAGS_FOLDER/example_dags/example_bash_operator.py']
[2017-10-03 21:57:51,264] {base_task_runner.py:98} INFO - Subtask: [2017-10-03 21:57:51,263] {__init__.py:45} INFO - Using executor SequentialExecutor
[2017-10-03 21:57:51,306] {base_task_runner.py:98} INFO - Subtask: [2017-10-03 21:57:51,306] {models.py:186} INFO - Filling up the DagBag from /airflow/dags/example_dags/example_bash_operator.py

Note that the path to the remote log file is listed on the first line.

Writing Logs to Elasticsearch

Airflow can be configured to read task logs from Elasticsearch and optionally write logs to stdout in standard or json format. These logs can later be collected and forwarded to the Elasticsearch cluster using tools like fluentd, logstash or others.

You can choose to have all task logs from workers output to the highest parent level process, instead of the standard file locations. This allows for some additional flexibility in container environments like Kubernetes, where container stdout is already being logged to the host nodes. From there a log shipping tool can be used to forward them along to Elasticsearch. To use this feature, set the write_stdout option in airflow.cfg. You can also choose to have the logs output in a JSON format, using the json_format option. Airflow uses the standard Python logging module and JSON fields are directly extracted from the LogRecord object. To use this feature, set the json_fields option in airflow.cfg. Add the fields to the comma-delimited string that you want collected for the logs. These fields are from the LogRecord object in the logging module. Documentation on different attributes can be found here.

First, to use the handler, airflow.cfg must be configured as follows:

[core]
# Airflow can store logs remotely in AWS S3, Google Cloud Storage or Elastic Search.
# Users must supply an Airflow connection id that provides access to the storage
# location. If remote_logging is set to true, see UPDATING.md for additional
# configuration requirements.
remote_logging = True

[elasticsearch]
log_id_template = {{dag_id}}-{{task_id}}-{{execution_date}}-{{try_number}}
end_of_log_mark = end_of_log
write_stdout =
json_fields =

To output task logs to stdout in JSON format, the following config could be used:

[core]
# Airflow can store logs remotely in AWS S3, Google Cloud Storage or Elastic Search.
# Users must supply an Airflow connection id that provides access to the storage
# location. If remote_logging is set to true, see UPDATING.md for additional
# configuration requirements.
remote_logging = True

[elasticsearch]
log_id_template = {{dag_id}}-{{task_id}}-{{execution_date}}-{{try_number}}
end_of_log_mark = end_of_log
write_stdout = True
json_format = True
json_fields = asctime, filename, lineno, levelname, message

Writing Logs to Elasticsearch over TLS

To add custom configurations to ElasticSearch (e.g. turning on ssl_verify, adding a custom self-signed cert, etc.) use the elasticsearch_configs setting in your airfow.cfg

[core]
# Airflow can store logs remotely in AWS S3, Google Cloud Storage or Elastic Search.
# Users must supply an Airflow connection id that provides access to the storage
# location. If remote_logging is set to true, see UPDATING.md for additional
# configuration requirements.
remote_logging = True

[elasticsearch_configs]
use_ssl=True
verify_certs=True
ca_certs=/path/to/CA_certs