Logging for Tasks¶
Airflow writes logs for tasks in a way that allows you to see the logs for each task separately in the Airflow UI. Core Airflow provides an interface FileTaskHandler, which writes task logs to file, and includes a mechanism to serve them from workers while tasks are running. The Apache Airflow Community also releases providers for many services (Provider packages) and some of them provide handlers that extend the logging capability of Apache Airflow. You can see all of these providers in Writing logs.
When using S3, GCS, WASB or OSS remote logging service, you can delete the local log files after they are uploaded to the remote location, by setting the config:
[logging]
remote_logging = True
remote_base_log_folder = schema://path/to/remote/log
delete_local_logs = True
Configuring logging¶
For the default handler, FileTaskHandler, you can specify the directory to place log files in airflow.cfg
using
base_log_folder
. By default, logs are placed in the AIRFLOW_HOME
directory.
Note
For more information on setting the configuration, see Setting Configuration Options
The default pattern is followed while naming log files for tasks:
For normal tasks:
dag_id={dag_id}/run_id={run_id}/task_id={task_id}/attempt={try_number}.log
.For dynamically mapped tasks:
dag_id={dag_id}/run_id={run_id}/task_id={task_id}/map_index={map_index}/attempt={try_number}.log
.
These patterns can be adjusted by log_filename_template.
In addition, you can supply a remote location to store current logs and backups.
Writing to task logs from your code¶
Airflow uses standard the Python logging framework to write logs, and for the duration of a task, the root logger is configured to write to the task’s log.
Most operators will write logs to the task log automatically. This is because they
have a log
logger that you can use to write to the task log.
This logger is created and configured by LoggingMixin
that all
operators derive from. But also due to the root logger handling, any standard logger (using default settings) that
propagates logging to the root will also write to the task log.
So if you want to log to the task log from custom code of yours you can do any of the following:
Log with the
self.log
logger from BaseOperatorUse standard
print
statements to print tostdout
(not recommended, but in some cases it can be useful)Use the standard logger approach of creating a logger using the Python module name and using it to write to the task log
This is the usual way loggers are used directly in Python code:
import logging
logger = logging.getLogger(__name__)
logger.info("This is a log message")
Interleaving of logs¶
Airflow’s remote task logging handlers can broadly be separated into two categories: streaming handlers (such as ElasticSearch, AWS Cloudwatch, and GCP operations logging, formerly stackdriver) and blob storage handlers (e.g. S3, GCS, WASB).
For blob storage handlers, depending on the state of the task, logs could be in a lot of different places and in multiple different files. For this reason, we need to check all locations and interleave what we find. To do this we need to be able to parse the timestamp for each line. If you are using a custom formatter you may need to override the default parser by providing a callable name at Airflow setting [logging] interleave_timestamp_parser
.
For streaming handlers, no matter the task phase or location of execution, all log messages can be sent to the logging service with the same identifier so generally speaking there isn’t a need to check multiple sources and interleave.
Troubleshooting¶
If you want to check which task handler is currently set, you can use the airflow info
command as in
the example below.
$ airflow info
...
airflow on PATH: [True]
Executor: [SequentialExecutor]
Task Logging Handlers: [StackdriverTaskHandler]
SQL Alchemy Conn: [sqlite://///root/airflow/airflow.db]
DAGs Folder: [/root/airflow/dags]
Plugins Folder: [/root/airflow/plugins]
Base Log Folder: [/root/airflow/logs]
You can also run airflow config list
to check that the logging configuration options have valid values.
Advanced configuration¶
Not all configuration options are available from the airflow.cfg
file. Some configuration options require
that the logging config class be overwritten. This can be done via the logging_config_class
option
in airflow.cfg
file. This option should specify the import path to a configuration compatible with
logging.config.dictConfig()
. If your file is a standard import location, then you should set a PYTHONPATH
environment variable.
Follow the steps below to enable custom logging config class:
Start by setting environment variable to known directory e.g.
~/airflow/
export PYTHONPATH=~/airflow/
Create a directory to store the config file e.g.
~/airflow/config
Create file called
~/airflow/config/log_config.py
with following the contents:from copy import deepcopy from airflow.config_templates.airflow_local_settings import DEFAULT_LOGGING_CONFIG LOGGING_CONFIG = deepcopy(DEFAULT_LOGGING_CONFIG)
At the end of the file, add code to modify the default dictionary configuration.
Update
$AIRFLOW_HOME/airflow.cfg
to contain:[logging] remote_logging = True logging_config_class = log_config.LOGGING_CONFIG
Restart the application.
See Modules Management for details on how Python and Airflow manage modules.
External Links¶
When using remote logging, you can configure Airflow to show a link to an external UI within the Airflow Web UI. Clicking the link redirects you to the external UI.
Some external systems require specific configuration in Airflow for redirection to work but others do not.
Serving logs from workers and triggerer¶
Most task handlers send logs upon completion of a task. In order to view logs in real time, Airflow starts an HTTP server to serve the logs in the following cases:
If
SequentialExecutor
orLocalExecutor
is used, then whenairflow scheduler
is running.If
CeleryExecutor
is used, then whenairflow worker
is running.
In triggerer, logs are served unless the service is started with option --skip-serve-logs
.
The server is running on the port specified by worker_log_server_port
option in [logging]
section, and option triggerer_log_server_port
for triggerer. Defaults are 8793 and 8794, respectively.
Communication between the webserver and the worker is signed with the key specified by secret_key
option in [webserver]
section. You must ensure that the key matches so that communication can take place without problems.
We are using Gunicorn as a WSGI server. Its configuration options can be overridden with the GUNICORN_CMD_ARGS
env variable. For details, see Gunicorn settings.
Implementing a custom file task handler¶
Note
This is an advanced topic and most users should be able to just use an existing handler from Writing logs.
In our providers we have a healthy variety of options with all the major cloud providers. But should you need to implement logging with a different service, and should you then decide to implement a custom FileTaskHandler, there are a few settings to be aware of, particularly in the context of trigger logging.
Triggers require a shift in the way that logging is set up. In contrast with tasks, many triggers run in the same process, and with triggers, since they run in asyncio, we have to be mindful of not introducing blocking calls through the logging handler. And because of the variation in handler behavior (some write to file, some upload to blob storage, some send messages over network as they arrive, some do so in thread), we need to have some way to let triggerer know how to use them.
To accomplish this we have a few attributes that may be set on the handler, either the instance or the class. Inheritance is not respected for these parameters, because subclasses of FileTaskHandler may differ from it in the relevant characteristics. These params are described below:
trigger_should_wrap
: Controls whether this handler should be wrapped by TriggerHandlerWrapper. This is necessary when each instance of handler creates a file handler that it writes all messages to.trigger_should_queue
: Controls whether the triggerer should put a QueueListener between the event loop and the handler, to ensure blocking IO in the handler does not disrupt the event loop.trigger_send_end_marker
: Controls whether an END signal should be sent to the logger when trigger completes. It is used to tell the wrapper to close and remove the individual file handler specific to the trigger that just completed.trigger_supported
: Iftrigger_should_wrap
andtrigger_should_queue
are not True, we generally assume that the handler does not support triggers. But if in this case the handler hastrigger_supported
set to True, then we’ll still move the handler to root at triggerer start so that it will process trigger messages. Essentially, this should be true for handlers that “natively” support triggers. One such example of this is the StackdriverTaskHandler.