Tasks¶
A Task is the basic unit of execution in Airflow. Tasks are arranged into DAGs, and then have upstream and downstream dependencies set between them in order to express the order they should run in.
There are three basic kinds of Task:
Operators, predefined task templates that you can string together quickly to build most parts of your DAGs.
Sensors, a special subclass of Operators which are entirely about waiting for an external event to happen.
A TaskFlow-decorated
@task
, which is a custom Python function packaged up as a Task.
Internally, these are all actually subclasses of Airflow’s BaseOperator
, and the concepts of Task and Operator are somewhat interchangeable, but it’s useful to think of them as separate concepts - essentially, Operators and Sensors are templates, and when you call one in a DAG file, you’re making a Task.
Relationships¶
The key part of using Tasks is defining how they relate to each other - their dependencies, or as we say in Airflow, their upstream and downstream tasks. You declare your Tasks first, and then you declare their dependencies second.
Note
We call the upstream task the one that is directly preceding the other task. We used to call it a parent task before. Be aware that this concept does not describe the tasks that are higher in the tasks hierarchy (i.e. they are not a direct parents of the task). Same definition applies to downstream task, which needs to be a direct child of the other task.
There are two ways of declaring dependencies - using the >>
and <<
(bitshift) operators:
first_task >> second_task >> [third_task, fourth_task]
Or the more explicit set_upstream
and set_downstream
methods:
first_task.set_downstream(second_task)
third_task.set_upstream(second_task)
These both do exactly the same thing, but in general we recommend you use the bitshift operators, as they are easier to read in most cases.
By default, a Task will run when all of its upstream (parent) tasks have succeeded, but there are many ways of modifying this behaviour to add branching, to only wait for some upstream tasks, or to change behaviour based on where the current run is in history. For more, see Control Flow.
Tasks don’t pass information to each other by default, and run entirely independently. If you want to pass information from one Task to another, you should use XComs.
Task Instances¶
Much in the same way that a DAG is instantiated into a DAG Run each time it runs, the tasks under a DAG are instantiated into Task Instances.
An instance of a Task is a specific run of that task for a given DAG (and thus for a given data interval). They are also the representation of a Task that has state, representing what stage of the lifecycle it is in.
The possible states for a Task Instance are:
none
: The Task has not yet been queued for execution (its dependencies are not yet met)scheduled
: The scheduler has determined the Task’s dependencies are met and it should runqueued
: The task has been assigned to an Executor and is awaiting a workerrunning
: The task is running on a worker (or on a local/synchronous executor)success
: The task finished running without errorsrestarting
: The task was externally requested to restart when it was runningfailed
: The task had an error during execution and failed to runskipped
: The task was skipped due to branching, LatestOnly, or similar.upstream_failed
: An upstream task failed and the Trigger Rule says we needed itup_for_retry
: The task failed, but has retry attempts left and will be rescheduled.up_for_reschedule
: The task is a Sensor that is inreschedule
modedeferred
: The task has been deferred to a triggerremoved
: The task has vanished from the DAG since the run started
Ideally, a task should flow from none
, to scheduled
, to queued
, to running
, and finally to success
.
When any custom Task (Operator) is running, it will get a copy of the task instance passed to it; as well as being able to inspect task metadata, it also contains methods for things like XComs.
Relationship Terminology¶
For any given Task Instance, there are two types of relationships it has with other instances.
Firstly, it can have upstream and downstream tasks:
task1 >> task2 >> task3
When a DAG runs, it will create instances for each of these tasks that are upstream/downstream of each other, but which all have the same data interval.
There may also be instances of the same task, but for different data intervals - from other runs of the same DAG. We call these previous and next - it is a different relationship to upstream and downstream!
Note
Some older Airflow documentation may still use “previous” to mean “upstream”. If you find an occurrence of this, please help us fix it!
Timeouts¶
If you want a task to have a maximum runtime, set its execution_timeout
attribute to a datetime.timedelta
value
that is the maximum permissible runtime. This applies to all Airflow tasks, including sensors. execution_timeout
controls the
maximum time allowed for every execution. If execution_timeout
is breached, the task times out and
AirflowTaskTimeout
is raised.
In addition, sensors have a timeout
parameter. This only matters for sensors in reschedule
mode. timeout
controls the maximum
time allowed for the sensor to succeed. If timeout
is breached, AirflowSensorTimeout
will be raised and the sensor fails immediately
without retrying.
The following SFTPSensor
example illustrates this. The sensor
is in reschedule
mode, meaning it
is periodically executed and rescheduled until it succeeds.
Each time the sensor pokes the SFTP server, it is allowed to take maximum 60 seconds as defined by
execution_timeout
.If it takes the sensor more than 60 seconds to poke the SFTP server,
AirflowTaskTimeout
will be raised. The sensor is allowed to retry when this happens. It can retry up to 2 times as defined byretries
.From the start of the first execution, till it eventually succeeds (i.e. after the file ‘root/test’ appears), the sensor is allowed maximum 3600 seconds as defined by
timeout
. In other words, if the file does not appear on the SFTP server within 3600 seconds, the sensor will raiseAirflowSensorTimeout
. It will not retry when this error is raised.If the sensor fails due to other reasons such as network outages during the 3600 seconds interval, it can retry up to 2 times as defined by
retries
. Retrying does not reset thetimeout
. It will still have up to 3600 seconds in total for it to succeed.
sensor = SFTPSensor(
task_id="sensor",
path="/root/test",
execution_timeout=timedelta(seconds=60),
timeout=3600,
retries=2,
mode="reschedule",
)
If you merely want to be notified if a task runs over but still let it run to completion, you want SLAs instead.
SLAs¶
An SLA, or a Service Level Agreement, is an expectation for the maximum time a Task should be completed relative to the Dag Run start time. If a task takes longer than this to run, it is then visible in the “SLA Misses” part of the user interface, as well as going out in an email of all tasks that missed their SLA.
Tasks over their SLA are not cancelled, though - they are allowed to run to completion. If you want to cancel a task after a certain runtime is reached, you want Timeouts instead.
To set an SLA for a task, pass a datetime.timedelta
object to the Task/Operator’s sla
parameter. You can also supply an sla_miss_callback
that will be called when the SLA is missed if you want to run your own logic.
If you want to disable SLA checking entirely, you can set check_slas = False
in Airflow’s [core]
configuration.
To read more about configuring the emails, see Email Configuration.
Note
Manually-triggered tasks and tasks in event-driven DAGs will not be checked for an SLA miss. For more information on DAG schedule
values see DAG Run.
sla_miss_callback¶
You can also supply an sla_miss_callback
that will be called when the SLA is missed if you want to run your own logic.
The function signature of an sla_miss_callback
requires 5 parameters.
dag
task_list
String list (new-line separated, \n) of all tasks that missed their SLA since the last time that the
sla_miss_callback
ran.
blocking_task_list
Any task in the DAGRun(s) (with the same
execution_date
as a task that missed SLA) that is not in a SUCCESS state at the time that thesla_miss_callback
runs. i.e. ‘running’, ‘failed’. These tasks are described as tasks that are blocking itself or another task from completing before its SLA window is complete.
slas
List of
SlaMiss
objects associated with the tasks in thetask_list
parameter.
blocking_tis
List of the TaskInstance objects that are associated with the tasks in the
blocking_task_list
parameter.
Examples of sla_miss_callback
function signature:
def my_sla_miss_callback(dag, task_list, blocking_task_list, slas, blocking_tis):
...
def my_sla_miss_callback(*args):
...
Example DAG:
def sla_callback(dag, task_list, blocking_task_list, slas, blocking_tis):
print(
"The callback arguments are: ",
{
"dag": dag,
"task_list": task_list,
"blocking_task_list": blocking_task_list,
"slas": slas,
"blocking_tis": blocking_tis,
},
)
@dag(
schedule="*/2 * * * *",
start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
catchup=False,
sla_miss_callback=sla_callback,
default_args={"email": "email@example.com"},
)
def example_sla_dag():
@task(sla=datetime.timedelta(seconds=10))
def sleep_20():
"""Sleep for 20 seconds"""
time.sleep(20)
@task
def sleep_30():
"""Sleep for 30 seconds"""
time.sleep(30)
sleep_20() >> sleep_30()
example_dag = example_sla_dag()
Special Exceptions¶
If you want to control your task’s state from within custom Task/Operator code, Airflow provides two special exceptions you can raise:
AirflowSkipException
will mark the current task as skippedAirflowFailException
will mark the current task as failed ignoring any remaining retry attempts
These can be useful if your code has extra knowledge about its environment and wants to fail/skip faster - e.g., skipping when it knows there’s no data available, or fast-failing when it detects its API key is invalid (as that will not be fixed by a retry).
Zombie/Undead Tasks¶
No system runs perfectly, and task instances are expected to die once in a while. Airflow detects two kinds of task/process mismatch:
Zombie tasks are
TaskInstances
stuck in arunning
state despite their associated jobs being inactive (e.g. their process did not send a recent heartbeat as it got killed, or the machine died). Airflow will find these periodically, clean them up, and either fail or retry the task depending on its settings. Tasks can become zombies for many reasons, including:The Airflow worker ran out of memory and was OOMKilled.
The Airflow worker failed its liveness probe, so the system (for example, Kubernetes) restarted the worker.
The system (for example, Kubernetes) scaled down and moved an Airflow worker from one node to another.
Undead tasks are tasks that are not supposed to be running but are, often caused when you manually edit Task Instances via the UI. Airflow will find them periodically and terminate them.
Below is the code snippet from the Airflow scheduler that runs periodically to detect zombie/undead tasks.
def _find_zombies(self) -> None:
"""
Find zombie task instances and create a TaskCallbackRequest to be handled by the DAG processor.
Zombie instances are tasks haven't heartbeated for too long or have a no-longer-running LocalTaskJob.
"""
from airflow.jobs.job import Job
self.log.debug("Finding 'running' jobs without a recent heartbeat")
limit_dttm = timezone.utcnow() - timedelta(seconds=self._zombie_threshold_secs)
with create_session() as session:
zombies: list[tuple[TI, str, str]] = (
session.execute(
select(TI, DM.fileloc, DM.processor_subdir)
.with_hint(TI, "USE INDEX (ti_state)", dialect_name="mysql")
.join(Job, TI.job_id == Job.id)
.join(DM, TI.dag_id == DM.dag_id)
.where(TI.state == TaskInstanceState.RUNNING)
.where(
or_(
Job.state != JobState.RUNNING,
Job.latest_heartbeat < limit_dttm,
)
)
.where(Job.job_type == "LocalTaskJob")
.where(TI.queued_by_job_id == self.job.id)
)
.unique()
.all()
)
if zombies:
self.log.warning("Failing (%s) jobs without heartbeat after %s", len(zombies), limit_dttm)
with create_session() as session:
for ti, file_loc, processor_subdir in zombies:
zombie_message_details = self._generate_zombie_message_details(ti)
request = TaskCallbackRequest(
full_filepath=file_loc,
processor_subdir=processor_subdir,
simple_task_instance=SimpleTaskInstance.from_ti(ti),
msg=str(zombie_message_details),
)
session.add(
Log(
event="heartbeat timeout",
task_instance=ti.key,
extra=(
f"Task did not emit heartbeat within time limit ({self._zombie_threshold_secs} "
"seconds) and will be terminated. "
"See https://airflow.apache.org/docs/apache-airflow/"
"stable/core-concepts/tasks.html#zombie-undead-tasks"
),
)
)
self.log.error(
"Detected zombie job: %s "
"(See https://airflow.apache.org/docs/apache-airflow/"
"stable/core-concepts/tasks.html#zombie-undead-tasks)",
request,
)
self.job.executor.send_callback(request)
Stats.incr("zombies_killed", tags={"dag_id": ti.dag_id, "task_id": ti.task_id})
The explanation of the criteria used in the above snippet to detect zombie tasks is as below:
Task Instance State
Only task instances in the RUNNING state are considered potential zombies.
Job State and Heartbeat Check
Zombie tasks are identified if the associated job is not in the RUNNING state or if the latest heartbeat of the job is earlier than the calculated time threshold (limit_dttm). The heartbeat is a mechanism to indicate that a task or job is still alive and running.
Job Type
The job associated with the task must be of type
LocalTaskJob
.Queued by Job ID
Only tasks queued by the same job that is currently being processed are considered.
These conditions collectively help identify running tasks that may be zombies based on their state, associated job state, heartbeat status, job type, and the specific job that queued them. If a task meets these criteria, it is considered a potential zombie, and further actions, such as logging and sending a callback request, are taken.
Reproducing zombie tasks locally¶
If you’d like to reproduce zombie tasks for development/testing processes, follow the steps below:
Set the below environment variables for your local Airflow setup (alternatively you could tweak the corresponding config values in airflow.cfg)
export AIRFLOW__SCHEDULER__LOCAL_TASK_JOB_HEARTBEAT_SEC=600
export AIRFLOW__SCHEDULER__SCHEDULER_ZOMBIE_TASK_THRESHOLD=2
export AIRFLOW__SCHEDULER__ZOMBIE_DETECTION_INTERVAL=5
Have a DAG with a task that takes about 10 minutes to complete(i.e. a long-running task). For example, you could use the below DAG:
from airflow.decorators import dag
from airflow.operators.bash import BashOperator
from datetime import datetime
@dag(start_date=datetime(2021, 1, 1), schedule="@once", catchup=False)
def sleep_dag():
t1 = BashOperator(
task_id="sleep_10_minutes",
bash_command="sleep 600",
)
sleep_dag()
Run the above DAG and wait for a while. You should see the task instance becoming a zombie task and then being killed by the scheduler.
Executor Configuration¶
Some Executors allow optional per-task configuration - such as the KubernetesExecutor
, which lets you set an image to run the task on.
This is achieved via the executor_config
argument to a Task or Operator. Here’s an example of setting the Docker image for a task that will run on the KubernetesExecutor
:
MyOperator(...,
executor_config={
"KubernetesExecutor":
{"image": "myCustomDockerImage"}
}
)
The settings you can pass into executor_config
vary by executor, so read the individual executor documentation in order to see what you can set.