Checking Airflow Health Status¶
Airflow has two methods to check the health of components - HTTP checks and CLI checks. All available checks are accessible through the CLI, but only some are accessible through HTTP due to the role of the component being checked and the tools being used to monitor the deployment.
For example, when running on Kubernetes, use a Liveness probes (livenessProbe
property)
with CLI checks on the scheduler deployment to restart it when it fails.
For the webserver, you can configure the readiness probe (readinessProbe
property) using Webserver Health Check Endpoint.
For an example for a Docker Compose environment, see the docker-compose.yaml
file available in the Running Airflow in Docker.
Webserver Health Check Endpoint¶
To check the health status of your Airflow instance, you can simply access the endpoint
/health
. It will return a JSON object in which a high-level glance is provided.
{
"metadatabase":{
"status":"healthy"
},
"scheduler":{
"status":"healthy",
"latest_scheduler_heartbeat":"2018-12-26 17:15:11+00:00"
},
"triggerer":{
"status":"healthy",
"latest_triggerer_heartbeat":"2018-12-26 17:16:12+00:00"
},
"dag_processor":{
"status":"healthy",
"latest_dag_processor_heartbeat":"2018-12-26 17:16:12+00:00"
}
}
The
status
of each component can be either “healthy” or “unhealthy”The status of
metadatabase
depends on whether a valid connection can be initiated with the databaseThe status of
scheduler
depends on when the latest scheduler heartbeat was receivedIf the last heartbeat was received more than 30 seconds (default value) earlier than the current time, the scheduler is considered unhealthy
This threshold value can be specified using the option
scheduler_health_check_threshold
within the[scheduler]
section inairflow.cfg
If you run more than one scheduler, only the state of one scheduler will be reported, i.e. only one working scheduler is enough for the scheduler state to be considered healthy
The status of the
triggerer
behaves exactly like that of thescheduler
as described above. Note that thestatus
andlatest_triggerer_heartbeat
fields in the health check response will be null for deployments that do not include atriggerer
component.The status of the
dag_processor
behaves exactly like that of thescheduler
as described above. Note that thestatus
andlatest_dag_processor_heartbeat
fields in the health check response will be null for deployments that do not include adag_processor
component.
Please keep in mind that the HTTP response code of /health
endpoint should not be used to determine the health
status of the application. The return code is only indicative of the state of the rest call (200 for success).
Served by the web server, this health check endpoint is independent of the newer Scheduler Health Check Server, which optionally runs on each scheduler.
Note
For this check to work, at least one working web server is required. Suppose you use this check for scheduler monitoring, then in case of failure of the web server, you will lose the ability to monitor scheduler, which means that it can be restarted even if it is in good condition. For greater confidence, consider using CLI Check for Scheduler or Scheduler Health Check Server.
Scheduler Health Check Server¶
In order to check scheduler health independent of the web server, Airflow optionally starts a small HTTP server
in each scheduler to serve a scheduler \health
endpoint. It returns status code 200
when the scheduler
is healthy and status code 503
when the scheduler is unhealthy. To run this server in each scheduler, set
[scheduler]enable_health_check
to True
. By default, it is False
. The server is running on the port
specified by the [scheduler]scheduler_health_check_server_port
option. By default, it is 8974
. We are
using http.server.BaseHTTPRequestHandler as a small server.
CLI Check for Scheduler¶
Scheduler creates an entry in the table airflow.jobs.job.Job
with information about the host and
timestamp (heartbeat) at startup, and then updates it regularly. You can use this to check if the scheduler is
working correctly. To do this, you can use the airflow jobs checks
command. On failure, the command will exit
with a non-zero error code.
To check if the local scheduler is still working properly, run:
airflow jobs check --job-type SchedulerJob --local
To check if any scheduler is running when you are using high availability, run:
airflow jobs check --job-type SchedulerJob --allow-multiple --limit 100
CLI Check for Database¶
To verify that the database is working correctly, you can use the airflow db check
command. On failure, the command will exit
with a non-zero error code.
HTTP monitoring for Celery Cluster¶
You can optionally use Flower to monitor the health of the Celery cluster. It also provides an HTTP API that you can use to build a health check for your environment.
For details about installation, see: Celery Executor. For details about usage, see: The Flower project documentation.
CLI Check for Celery Workers¶
To verify that the Celery workers are working correctly, you can use the celery inspect ping
command. On failure, the command will exit
with a non-zero error code.
To check if the worker running on the local host is working correctly, run:
celery --app airflow.providers.celery.executors.celery_executor.app inspect ping -d celery@${HOSTNAME}
To check if the all workers in the cluster running is working correctly, run:
celery --app airflow.providers.celery.executors.celery_executor.app inspect ping
For more information, see: Management Command-line Utilities (inspect/control) and Workers Guide in the Celery documentation.