Building the image

Before you dive-deeply in the way how the Airflow Image is built, let us first explain why you might need to build the custom container image and we show a few typical ways you can do it.

Quick start scenarios of image extending

The most common scenarios where you want to build your own image are adding a new apt package, adding a new PyPI dependency and embedding DAGs into the image. Example Dockerfiles for those scenarios are below, and you can read further for more complex cases which might involve either extending or customizing the image. You will find more information about more complex scenarios below, but if your goal is to quickly extend the Airflow image with new provider, package, etc. then here is a quick start for you.

Adding new apt package

The following example adds vim to the Airflow image. When adding packages via apt you should switch to the root user when running the apt commands, but do not forget to switch back to the airflow user after installation is complete.

docs/docker-stack/docker-examples/extending/add-apt-packages/Dockerfile

FROM apache/airflow:2.3.2
USER root
RUN apt-get update \
  && apt-get install -y --no-install-recommends \
         vim \
  && apt-get autoremove -yqq --purge \
  && apt-get clean \
  && rm -rf /var/lib/apt/lists/*
USER airflow

Adding a new PyPI package

The following example adds lxml python package from PyPI to the image. When adding packages via pip you need to use the airflow user rather than root. Attempts to install pip packages as root will fail with an appropriate error message.

docs/docker-stack/docker-examples/extending/add-pypi-packages/Dockerfile

FROM apache/airflow:2.3.2
RUN pip install --no-cache-dir lxml

Embedding DAGs

The following example adds test_dag.py to your image in the /opt/airflow/dags folder.

docs/docker-stack/docker-examples/extending/embedding-dags/Dockerfile

FROM apache/airflow:2.3.2

COPY --chown=airflow:root test_dag.py /opt/airflow/dags

docs/docker-stack/docker-examples/extending/embedding-dags/test_dag.py[source]

"""This dag only runs some simple tasks to test Airflow's task execution."""
import datetime

import pendulum

from airflow.models.dag import DAG
from airflow.operators.empty import EmptyOperator

now = pendulum.now(tz="UTC")
now_to_the_hour = (now - datetime.timedelta(0, 0, 0, 0, 0, 3)).replace(minute=0, second=0, microsecond=0)
START_DATE = now_to_the_hour
DAG_NAME = 'test_dag_v1'

dag = DAG(
    DAG_NAME,
    schedule_interval='*/10 * * * *',
    default_args={'depends_on_past': True},
    start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
    catchup=False,
)

run_this_1 = EmptyOperator(task_id='run_this_1', dag=dag)
run_this_2 = EmptyOperator(task_id='run_this_2', dag=dag)
run_this_2.set_upstream(run_this_1)
run_this_3 = EmptyOperator(task_id='run_this_3', dag=dag)
run_this_3.set_upstream(run_this_2)

Extending vs. customizing the image

You might want to know very quickly whether you need to extend or customize the existing image for Apache Airflow. This chapter gives you a short answer to those questions.

Here is the comparison of the two approaches:

Extending

Customizing

Uses familiar ‘FROM’ pattern of image building

Yes

No

Requires only basic knowledge about images

Yes

No

Builds quickly

Yes

No

Produces image heavily optimized for size

No

Yes

Can build from custom airflow sources (forks)

No

Yes

Can build on air-gaped system

No

Yes

TL;DR; If you have a need to build custom image, it is easier to start with “Extending”. However, if your dependencies require compilation steps or when your require to build the image from security vetted packages, switching to “Customizing” the image provides much more optimized images. For example, if we compare equivalent images built by “Extending” and “Customization”, they end up being 1.1GB and 874MB respectively - a 20% improvement in size for the Customized image.

Note

You can also combine both - customizing & extending the image in one. You can build your optimized base image first using customization method (for example by your admin team) with all the heavy compilation required dependencies and you can publish it in your registry and let others extend your image using FROM and add their own lightweight dependencies. This reflects well the split where typically “Casual” users will Extend the image and “Power-users” will customize it.

Airflow Summit 2020’s Production Docker Image talk provides more details about the context, architecture and customization/extension methods for the Production Image.

Why customizing the image ?

The Apache Airflow community, releases Docker Images which are reference images for Apache Airflow. However, Airflow has more than 60 community managed providers (installable via extras) and some of the default extras/providers installed are not used by everyone, sometimes others extras/providers are needed, sometimes (very often actually) you need to add your own custom dependencies, packages or even custom providers.

In Kubernetes and Docker terms this means that you need another image with your specific requirements. This is why you should learn how to build your own Docker (or more properly Container) image. You might be tempted to use the reference image and dynamically install the new packages while starting your containers, but this is a bad idea for multiple reasons - starting from fragility of the build and ending with the extra time needed to install those packages - which has to happen every time every container starts. The only viable way to deal with new dependencies and requirements in production is to build and use your own image. You should only use installing dependencies dynamically in case of “hobbyist” and “quick start” scenarios when you want to iterate quickly to try things out and later replace it with your own images.

Building images primer

Note

The Dockerfile does not strictly follow the SemVer approach of Apache Airflow when it comes to features and backwards compatibility. While Airflow code strictly follows it, the Dockerfile is really a way to conveniently package Airflow using standard container approach, occasionally there are some changes in the building process or in the entrypoint of the image that require slight adaptation. Details of changes and adaptation needed can be found in the Changelog.

There are several most-typical scenarios that you will encounter and here is a quick recipe on how to achieve your goal quickly. In order to understand details you can read further, but for the simple cases using typical tools here are the simple examples.

In the simplest case building your image consists of those steps:

  1. Create your own Dockerfile (name it Dockerfile) where you add:

  • information what your image should be based on (for example FROM: apache/airflow:|airflow-version|-python3.8

  • additional steps that should be executed in your image (typically in the form of RUN <command>)

  1. Build your image. This can be done with docker CLI tools and examples below assume docker is used. There are other tools like kaniko or podman that allow you to build the image, but docker is so far the most popular and developer-friendly tool out there. Typical way of building the image looks like follows (my-image:0.0.1 is the custom tag of your image containing version). In case you use some kind of registry where you will be using the image from, it is usually named in the form of registry/image-name. The name of the image has to be configured for the deployment method your image will be deployed. This can be set for example as image name in the docker-compose file or in the Helm chart.

docker build . -f Dockerfile --pull --tag my-image:0.0.1
  1. [Optional] Test the image. Airflow contains tool that allows you to test the image. This step however, requires locally checked out or extracted Airflow sources. If you happen to have the sources you can test the image by running this command (in airflow root folder). The output will tell you if the image is “good-to-go”.

./scripts/ci/tools/verify_docker_image.sh PROD my-image:0.0.1
  1. Once you build the image locally you have usually several options to make them available for your deployment:

  • For docker-compose deployment, if you’ve already built your image, and want to continue building the image manually when needed with docker build, you can edit the docker-compose.yaml and replace the “apache/airflow:<version>” image with the image you’ve just built my-image:0.0.1 - it will be used from your local Docker Engine cache. You can also simply set AIRFLOW_IMAGE_NAME variable to point to your image and docker-compose will use it automatically without having to modify the file.

  • Also for docker-compose deployment, you can delegate image building to the docker-compose. To do that - open your docker-compose.yaml file and search for the phrase “In order to add custom dependencies”. Follow these instructions of commenting the “image” line and uncommenting the “build” line. This is a standard docker-compose feature and you can read about it in Docker Compose build reference. Run docker-compose build to build the images. Similarly as in the previous case, the image is stored in Docker engine cache and Docker Compose will use it from there. The docker-compose build command uses the same docker build command that you can run manually under-the-hood.

  • For some - development targeted - Kubernetes deployments you can load the images directly to Kubernetes clusters. Clusters such as kind or minikube have dedicated load method to load the images to the cluster.

  • Last but not least - you can push your image to a remote registry which is the most common way of storing and exposing the images, and it is most portable way of publishing the image. Both Docker-Compose and Kubernetes can make use of images exposed via registries.

Extending the image

Extending the image is easiest if you just need to add some dependencies that do not require compiling. The compilation framework of Linux (so called build-essential) is pretty big, and for the production images, size is really important factor to optimize for, so our Production Image does not contain build-essential. If you need a compiler like gcc or g++ or make/cmake etc. - those are not found in the image and it is recommended that you follow the “customize” route instead.

How to extend the image - it is something you are most likely familiar with - simply build a new image using Dockerfile’s FROM directive and add whatever you need. Then you can add your Debian dependencies with apt or PyPI dependencies with pip install or any other stuff you need.

Base images

There are two types of images you can extend your image from:

  1. Regular Airflow image that contains the most common extras and providers, and all supported backend database clients for AMD64 platform and Postgres for ARM64 platform.

  2. Slim Airflow image, which is a minimal image, contains all supported backends database clients installed for AMD64 platform and Postgres for ARM64 platform, but contains no extras or providers, except the 4 default providers.

Note

Differences of slim image vs. regular image.

The slim image is small comparing to regular image (~500 MB vs ~1.1GB) and you might need to add a lot more packages and providers in order to make it useful for your case (but if you use only a small subset of providers, it might be a good starting point for you).

The slim images might have dependencies in different versions than those used when providers are preinstalled, simply because core Airflow might have less limits on the versions on its own. When you install some providers they might require downgrading some dependencies if the providers require different limits for the same dependencies.

Naming conventions for the images:

Image

Python

Standard image

Slim image

Latest default Default Latest Specific

3.7 3.7 3.7,3.8,3.9,3.10 3.7,3.8,3.9,3.10

apache/airflow:latest apache/airflow:X.Y.Z apache/airflow:latest-pythonN.M apache/airflow:X.Y.Z-pythonN.M

apache/airflow:slim-latest apache/airflow:slim-X.Y.Z apache/airflow:slim-latest-pythonN.M apache/airflow:slim-X.Y.Z-pythonN.M

  • The “latest” image is always the latest released stable version available.

Important notes for the base images

You should be aware, about a few things:

  • The production image of airflow uses “airflow” user, so if you want to add some of the tools as root user, you need to switch to it with USER directive of the Dockerfile and switch back to airflow user when you are done. Also you should remember about following the best practices of Dockerfiles to make sure your image is lean and small.

  • The PyPI dependencies in Apache Airflow are installed in the user library, of the “airflow” user, so PIP packages are installed to ~/.local folder as if the --user flag was specified when running PIP. Note also that using --no-cache-dir is a good idea that can help to make your image smaller.

Note

Only as of 2.0.1 image the --user flag is turned on by default by setting PIP_USER environment variable to true. This can be disabled by un-setting the variable or by setting it to false. In the 2.0.0 image you had to add the --user flag as pip install --user command.

  • If your apt, or PyPI dependencies require some of the build-essential or other packages that need to compile your python dependencies, then your best choice is to follow the “Customize the image” route, because you can build a highly-optimized (for size) image this way. However it requires you to use the Dockerfile that is released as part of Apache Airflow sources (also available at Dockerfile)

  • You can also embed your dags in the image by simply adding them with COPY directive of Airflow. The DAGs in production image are in /opt/airflow/dags folder.

  • You can build your image without any need for Airflow sources. It is enough that you place the Dockerfile and any files that are referred to (such as Dag files) in a separate directory and run a command docker build . --pull --tag my-image:my-tag (where my-image is the name you want to name it and my-tag is the tag you want to tag the image with.

  • If your way of extending image requires to create writable directories, you MUST remember about adding umask 0002 step in your RUN command. This is necessary in order to accommodate our approach for running the image with an arbitrary user. Such user will always run with GID=0 - the entrypoint will prevent non-root GIDs. You can read more about it in arbitrary docker user documentation for the entrypoint. The umask 0002 is set as default when you enter the image, so any directories you create by default in runtime, will have GID=0 and will be group-writable.

Note

When you build image for Airflow version < 2.1 (for example 2.0.2 or 1.10.15) the image is built with PIP 20.2.4 because PIP21+ is only supported for Airflow 2.1+

Note

Only as of 2.0.2 the default group of airflow user is root. Previously it was airflow, so if you are building your images based on an earlier image, you need to manually change the default group for airflow user:

RUN usermod -g 0 airflow

Examples of image extending

Example of customizing Airflow Provider packages

The Airflow Providers are released independently of core Airflow and sometimes you might want to upgrade specific providers only to fix some problems or use features available in that provider version. Here is an example of how you can do it

docs/docker-stack/docker-examples/extending/custom-providers/Dockerfile

FROM apache/airflow:2.3.2
RUN pip install --no-cache-dir apache-airflow-providers-docker==2.5.1

Example of adding Airflow Provider package and apt package

The following example adds apache-spark airflow-providers which requires both java and python package from PyPI.

docs/docker-stack/docker-examples/extending/add-providers/Dockerfile

FROM apache/airflow:2.3.2
USER root
RUN apt-get update \
  && apt-get install -y --no-install-recommends \
         openjdk-11-jre-headless \
  && apt-get autoremove -yqq --purge \
  && apt-get clean \
  && rm -rf /var/lib/apt/lists/*
USER airflow
ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
RUN pip install --no-cache-dir apache-airflow-providers-apache-spark==2.1.3

Example of adding apt package

The following example adds vim to the airflow image.

docs/docker-stack/docker-examples/extending/add-apt-packages/Dockerfile

FROM apache/airflow:2.3.2
USER root
RUN apt-get update \
  && apt-get install -y --no-install-recommends \
         vim \
  && apt-get autoremove -yqq --purge \
  && apt-get clean \
  && rm -rf /var/lib/apt/lists/*
USER airflow

Example of adding PyPI package

The following example adds lxml python package from PyPI to the image.

docs/docker-stack/docker-examples/extending/add-pypi-packages/Dockerfile

FROM apache/airflow:2.3.2
RUN pip install --no-cache-dir lxml

Example when writable directory is needed

The following example adds a new directory that is supposed to be writable for any arbitrary user running the container.

docs/docker-stack/docker-examples/extending/writable-directory/Dockerfile

FROM apache/airflow:2.3.2
RUN umask 0002; \
    mkdir -p ~/writeable-directory

Example when you add packages requiring compilation

The following example adds mpi4py package which requires both build-essential and mpi compiler.

docs/docker-stack/docker-examples/extending/add-build-essential-extend/Dockerfile

FROM apache/airflow:2.3.2
USER root
RUN apt-get update \
  && apt-get install -y --no-install-recommends \
         build-essential libopenmpi-dev \
  && apt-get autoremove -yqq --purge \
  && apt-get clean \
  && rm -rf /var/lib/apt/lists/*
USER airflow
RUN pip install --no-cache-dir mpi4py

The size of this image is ~ 1.1 GB when build. As you will see further, you can achieve 20% reduction in size of the image in case you use “Customizing” rather than “Extending” the image.

Example when you want to embed DAGs

The following example adds test_dag.py to your image in the /opt/airflow/dags folder.

docs/docker-stack/docker-examples/extending/embedding-dags/Dockerfile

FROM apache/airflow:2.3.2

COPY --chown=airflow:root test_dag.py /opt/airflow/dags

docs/docker-stack/docker-examples/extending/embedding-dags/test_dag.py[source]

"""This dag only runs some simple tasks to test Airflow's task execution."""
import datetime

import pendulum

from airflow.models.dag import DAG
from airflow.operators.empty import EmptyOperator

now = pendulum.now(tz="UTC")
now_to_the_hour = (now - datetime.timedelta(0, 0, 0, 0, 0, 3)).replace(minute=0, second=0, microsecond=0)
START_DATE = now_to_the_hour
DAG_NAME = 'test_dag_v1'

dag = DAG(
    DAG_NAME,
    schedule_interval='*/10 * * * *',
    default_args={'depends_on_past': True},
    start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
    catchup=False,
)

run_this_1 = EmptyOperator(task_id='run_this_1', dag=dag)
run_this_2 = EmptyOperator(task_id='run_this_2', dag=dag)
run_this_2.set_upstream(run_this_1)
run_this_3 = EmptyOperator(task_id='run_this_3', dag=dag)
run_this_3.set_upstream(run_this_2)

Customizing the image

Warning

BREAKING CHANGE! As of Airflow 2.3.0 you need to use Buildkit to build customized Airflow Docker image. We are using new features of Building (and dockerfile:1.4 syntax) to make our image faster to build and “standalone” - i.e. not needing any extra files from Airflow in order to be build. As of Airflow 2.3.0, the Dockerfile that is released with Airflow does not need any extra folders or files and can be copied and used from any folder. Previously you needed to copy Airflow sources together with the Dockerfile as some scripts were needed to make it work. You also need to use DOCKER_CONTEXT_FILES build arg if you want to use your own custom files during the build (see Using docker context files for details).

Note

You can usually use the latest Dockerfile released by Airflow to build previous Airflow versions. Note however, that there are slight changes in the Dockerfile and entrypoint scripts that can make it behave slightly differently, depending which Dockerfile version you used. Details of what has changed in each of the released versions of Docker image can be found in the Changelog.

Prerequisites for building customized docker image:

  • You need to enable Buildkit to build the image. This can be done by setting DOCKER_BUILDKIT=1 as an environment variable or by installing the buildx plugin and running docker buildx build command.

  • You need to have a new Docker installed to handle 1.4 syntax of the Dockerfile. Docker version 20.10.7 and above is known to work.

Before attempting to customize the image, you need to download flexible and customizable Dockerfile. You can extract the officially released version of the Dockerfile from the released sources. You can also conveniently download the latest released version from GitHub. You can save it in any directory - there is no need for any other files to be present there. If you wish to use your own files (for example custom configuration of pip or your own requirements or custom dependencies, you need to use DOCKER_CONTEXT_FILES build arg and place the files in the directory pointed at by the arg (see Using docker context files for details).

Customizing the image is an optimized way of adding your own dependencies to the image - better suited to prepare highly optimized (for size) production images, especially when you have dependencies that require to be compiled before installing (such as mpi4py).

It also allows more sophisticated usages, needed by “Power-users” - for example using forked version of Airflow, or building the images from security-vetted sources.

The big advantage of this method is that it produces optimized image even if you need some compile-time dependencies that are not needed in the final image.

The disadvantage it that building the image takes longer and it requires you to use the Dockerfile that is released as part of Apache Airflow sources.

The disadvantage is that the pattern of building Docker images with --build-arg is less familiar to developers of such images. However it is quite well-known to “power-users”. That’s why the customizing flow is better suited for those users who have more familiarity and have more custom requirements.

The image also usually builds much longer than the equivalent “Extended” image because instead of extending the layers that are already coming from the base image, it rebuilds the layers needed to add extra dependencies needed at early stages of image building.

When customizing the image you can choose a number of options how you install Airflow:

  • From the PyPI releases (default)

  • From the custom installation sources - using additional/replacing the original apt or PyPI repositories

  • From local sources. This is used mostly during development.

  • From tag or branch, or specific commit from a GitHub Airflow repository (or fork). This is particularly useful when you build image for a custom version of Airflow that you keep in your fork and you do not want to release the custom Airflow version to PyPI.

  • From locally stored binary packages for Airflow, Airflow Providers and other dependencies. This is particularly useful if you want to build Airflow in a highly-secure environment where all such packages must be vetted by your security team and stored in your private artifact registry. This also allows to build airflow image in an air-gaped environment.

  • Side note. Building Airflow in an air-gaped environment sounds pretty funny, doesn’t it?

You can also add a range of customizations while building the image:

  • base python image you use for Airflow

  • version of Airflow to install

  • extras to install for Airflow (or even removing some default extras)

  • additional apt/python dependencies to use while building Airflow (DEV dependencies)

  • add requirements.txt file to docker-context-files directory to add extra requirements

  • additional apt/python dependencies to install for runtime version of Airflow (RUNTIME dependencies)

  • additional commands and variables to set if needed during building or preparing Airflow runtime

  • choosing constraint file to use when installing Airflow

Additional explanation is needed for the last point. Airflow uses constraints to make sure that it can be predictably installed, even if some new versions of Airflow dependencies are released (or even dependencies of our dependencies!). The docker image and accompanying scripts usually determine automatically the right versions of constraints to be used based on the Airflow version installed and Python version. For example 2.0.2 version of Airflow installed from PyPI uses constraints from constraints-2.0.2 tag). However in some cases - when installing airflow from GitHub for example - you have to manually specify the version of constraints used, otherwise it will default to the latest version of the constraints which might not be compatible with the version of Airflow you use.

You can also download any version of Airflow constraints and adapt it with your own set of constraints and manually set your own versions of dependencies in your own constraints and use the version of constraints that you manually prepared.

You can read more about constraints in Installation from PyPI

Note that if you place requirements.txt in the docker-context-files folder, it will be used to install all requirements declared there. It is recommended that the file contains specified version of dependencies to add with == version specifier, to achieve stable set of requirements, independent if someone releases a newer version. However you have to make sure to update those requirements and rebuild the images to account for latest security fixes.

Choosing Debian version when customizing the image

The reference Airflow image currently uses bullseye version of Debian (also known as Debian 10) as base image, however when you want to build a custom image, you can also use buster version of base images. Airflow supports both versions of Debian. You choose which version of Debian to use by choosing the right version of python base image:

  • --build-arg PYTHON_BASE_IMAGE="python:3.7-slim-buster uses buster version of Debian (Debian 10)

  • --build-arg PYTHON_BASE_IMAGE="python:3.7-slim-bullseye uses bullseye version of Debian (Debian 11)

Using docker-context-files

When customizing the image, you can optionally make Airflow install custom binaries or provide custom configuration for your pip in docker-context-files. In order to enable it, you need to add --build-arg DOCKER_CONTEXT_FILES=docker-context-files build arg when you build the image. You can pass any subdirectory of your docker context, it will always be mapped to /docker-context-files during the build.

You can use docker-context-files for the following purposes:

  • you can place requirements.txt and add any pip packages you want to install in the docker-context-file folder. Those requirements will be automatically installed during the build.

docs/docker-stack/docker-examples/customizing/own-requirements.sh

mkdir -p docker-context-files

cat <<EOF >./docker-context-files/requirements.txt
beautifulsoup4==4.10.0
EOF

export DOCKER_BUILDKIT=1
docker build . \
    --build-arg DOCKER_CONTEXT_FILES=./docker-context-files \
    --tag "my-beautifulsoup4-airflow:0.0.1"
docker run -it my-beautifulsoup4-airflow:0.0.1 python -c 'import bs4; import sys; sys.exit(0)' && \
    echo "Success! Beautifulsoup4 installed" && echo
  • you can place pip.conf (and legacy .piprc) in the docker-context-files folder and they will be used for all pip commands (for example you can configure your own sources or authentication mechanisms)

docs/docker-stack/docker-examples/customizing/custom-pip.sh

mkdir -p docker-context-files

cat <<EOF >./docker-context-files/pip.conf
[global]
verbose = 2
EOF

export DOCKER_BUILDKIT=1
docker build . \
    --build-arg DOCKER_CONTEXT_FILES=./docker-context-files \
    --tag "my-custom-pip-verbose-airflow:0.0.1"
docker run -it my-beautifulsoup4-airflow:0.0.1 python -c 'import bs4; import sys; sys.exit(0)' && \
    echo "Success! Beautifulsoup4 installed" && echo
  • you can place .whl packages that you downloaded and install them with INSTALL_PACKAGES_FROM_CONTEXT set to true . It’s useful if you build the image in restricted security environments (see: Build images in security restricted environments for details):

docs/docker-stack/docker-examples/restricted/restricted_environments.sh

mkdir -p docker-context-files
export AIRFLOW_VERSION="2.2.4"
rm docker-context-files/*.whl docker-context-files/*.tar.gz docker-context-files/*.txt || true

curl -Lo "docker-context-files/constraints-3.7.txt" \
    "https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-3.7.txt"

echo
echo "Make sure you use the right python version here (should be same as in constraints)!"
echo
python --version

pip download --dest docker-context-files \
    --constraint docker-context-files/constraints-3.7.txt  \
    "apache-airflow[async,celery,elasticsearch,kubernetes,postgres,redis,ssh,statsd,virtualenv]==${AIRFLOW_VERSION}"

Note

You can also pass --build-arg DOCKER_CONTEXT_FILES=. if you want to place your requirements.txt in main directory without creating a dedicated folder, however this is a good practice to keep any files that you copy to the image context in a sub-folder. This makes it easier to separate things that are used on the host from those that are passed in Docker context. Of course, by default when you run docker build . the whole folder is available as “Docker build context” and sent to the docker engine, but the DOCKER_CONTEXT_FILES are always copied to the build segment of the image so copying all your local folder might unnecessarily increase time needed to build the image and your cache will be invalidated every time any of the files in your local folder change.

Warning

BREAKING CHANGE! As of Airflow 2.3.0 you need to specify additional flag: --build-arg DOCKER_CONTEXT_Files=docker-context-files in order to use the files placed in docker-context-files. Previously that switch was not needed. Unfortunately this change is needed in order to enable Dockerfile as standalone Dockerfile without any extra files. As of Airflow 2.3.0 the Dockerfile that is released with Airflow does not need any extra folders or files and can be copied and used from any folder. Previously you needed to copy Airflow sources together with the Dockerfile as some scripts were needed to make it work. With Airflow 2.3.0, we are using Buildkit features that enable us to make the Dockerfile a completely standalone file that can be used “as-is”.

Examples of image customizing

Building from PyPI packages

This is the basic way of building the custom images from sources.

The following example builds the production image in version 3.7 with latest PyPI-released Airflow, with default set of Airflow extras and dependencies. The latest PyPI-released Airflow constraints are used automatically.

docs/docker-stack/docker-examples/customizing/stable-airflow.sh

export DOCKER_BUILDKIT=1

docker build . \
    --tag "my-stable-airflow:0.0.1"

The following example builds the production image in version 3.7 with default extras from 2.3.0 Airflow package. The 2.3.0 constraints are used automatically.

docs/docker-stack/docker-examples/customizing/pypi-selected-version.sh

export AIRFLOW_VERSION=2.3.2
export DOCKER_BUILDKIT=1

docker build . \
    --build-arg PYTHON_BASE_IMAGE="python:3.7-slim-bullseye" \
    --build-arg AIRFLOW_VERSION="${AIRFLOW_VERSION}" \
    --tag "my-pypi-selected-version:0.0.1"

The following example builds the production image in version 3.8 with additional airflow extras (mssql,hdfs) from 2.3.0 PyPI package, and additional dependency (oauth2client).

docs/docker-stack/docker-examples/customizing/pypi-extras-and-deps.sh

export AIRFLOW_VERSION=2.3.2
export DEBIAN_VERSION="bullseye"
export DOCKER_BUILDKIT=1

docker build . \
    --pull \
    --build-arg PYTHON_BASE_IMAGE="python:3.8-slim-${DEBIAN_VERSION}" \
    --build-arg AIRFLOW_VERSION="${AIRFLOW_VERSION}" \
    --build-arg ADDITIONAL_AIRFLOW_EXTRAS="mssql,hdfs" \
    --build-arg ADDITIONAL_PYTHON_DEPS="oauth2client" \
    --tag "my-pypi-extras-and-deps:0.0.1"

The following example adds mpi4py package which requires both build-essential and mpi compiler.

docs/docker-stack/docker-examples/customizing/add-build-essential-custom.sh

export AIRFLOW_VERSION=2.2.4
export DEBIAN_VERSION="bullseye"
export DOCKER_BUILDKIT=1

docker build . \
    --pull \
    --build-arg PYTHON_BASE_IMAGE="python:3.7-slim-${DEBIAN_VERSION}" \
    --build-arg AIRFLOW_VERSION="${AIRFLOW_VERSION}" \
    --build-arg ADDITIONAL_PYTHON_DEPS="mpi4py" \
    --build-arg ADDITIONAL_DEV_APT_DEPS="libopenmpi-dev" \
    --build-arg ADDITIONAL_RUNTIME_APT_DEPS="openmpi-common" \
    --tag "my-build-essential-image:0.0.1"

The above image is equivalent of the “extended” image from previous chapter but its size is only 874 MB. Comparing to 1.1 GB of the “extended image” this is about 230 MB less, so you can achieve ~20% improvement in size of the image by using “customization” vs. extension. The saving can increase in case you have more complex dependencies to build.

Building optimized images

The following example the production image in version 3.7 with additional airflow extras from 2.0.2 PyPI package but it includes additional apt dev and runtime dependencies.

The dev dependencies are those that require build-essential and usually need to involve recompiling of some python dependencies so those packages might require some additional DEV dependencies to be present during recompilation. Those packages are not needed at runtime, so we only install them for the “build” time. They are not installed in the final image, thus producing much smaller images. In this case pandas requires recompilation so it also needs gcc and g++ as dev APT dependencies. The jre-headless does not require recompiling so it can be installed as the runtime APT dependency.

docs/docker-stack/docker-examples/customizing/pypi-dev-runtime-deps.sh

export AIRFLOW_VERSION=2.2.4
export DEBIAN_VERSION="bullseye"
export DOCKER_BUILDKIT=1

docker build . \
    --pull \
    --build-arg PYTHON_BASE_IMAGE="python:3.7-slim-${DEBIAN_VERSION}" \
    --build-arg AIRFLOW_VERSION="${AIRFLOW_VERSION}" \
    --build-arg ADDITIONAL_AIRFLOW_EXTRAS="jdbc" \
    --build-arg ADDITIONAL_PYTHON_DEPS="pandas" \
    --build-arg ADDITIONAL_DEV_APT_DEPS="gcc g++" \
    --build-arg ADDITIONAL_RUNTIME_APT_DEPS="default-jre-headless" \
    --tag "my-pypi-dev-runtime:0.0.1"

Building from GitHub

This method is usually used for development purpose. But in case you have your own fork you can point it to your forked version of source code without having to release it to PyPI. It is enough to have a branch or tag in your repository and use the tag or branch in the URL that you point the installation to.

In case of GitHub builds you need to pass the constraints reference manually in case you want to use specific constraints, otherwise the default constraints-main is used.

The following example builds the production image in version 3.7 with default extras from the latest main version and constraints are taken from latest version of the constraints-main branch in GitHub.

docs/docker-stack/docker-examples/customizing/github-main.sh

export DEBIAN_VERSION="bullseye"
export DOCKER_BUILDKIT=1

docker build . \
    --pull \
    --build-arg PYTHON_BASE_IMAGE="python:3.7-slim-${DEBIAN_VERSION}" \
    --build-arg AIRFLOW_INSTALLATION_METHOD="https://github.com/apache/airflow/archive/main.tar.gz#egg=apache-airflow" \
    --build-arg AIRFLOW_CONSTRAINTS_REFERENCE="constraints-main" \
    --tag "my-github-main:0.0.1"

The following example builds the production image with default extras from the latest v2-*-test version and constraints are taken from the latest version of the constraints-2-* branch in GitHub (for example v2-2-test branch matches constraints-2-2). Note that this command might fail occasionally as only the “released version” constraints when building a version and “main” constraints when building main are guaranteed to work.

docs/docker-stack/docker-examples/customizing/github-v2-2-test.sh

export DEBIAN_VERSION="bullseye"
export DOCKER_BUILDKIT=1

docker build . \
    --pull \
    --build-arg PYTHON_BASE_IMAGE="python:3.8-slim-${DEBIAN_VERSION}" \
    --build-arg AIRFLOW_INSTALLATION_METHOD="https://github.com/apache/airflow/archive/v2-2-test.tar.gz#egg=apache-airflow" \
    --build-arg AIRFLOW_CONSTRAINTS_REFERENCE="constraints-2-2" \
    --tag "my-github-v2-2:0.0.1"

You can also specify another repository to build from. If you also want to use different constraints repository source, you must specify it as additional CONSTRAINTS_GITHUB_REPOSITORY build arg.

The following example builds the production image using potiuk/airflow fork of Airflow and constraints are also downloaded from that repository.

docs/docker-stack/docker-examples/customizing/github-different-repository.sh

export DEBIAN_VERSION="bullseye"
export DOCKER_BUILDKIT=1

docker build . \
    --pull \
    --build-arg PYTHON_BASE_IMAGE="python:3.8-slim-${DEBIAN_VERSION}" \
    --build-arg AIRFLOW_INSTALLATION_METHOD="https://github.com/potiuk/airflow/archive/main.tar.gz#egg=apache-airflow" \
    --build-arg AIRFLOW_CONSTRAINTS_REFERENCE="constraints-main" \
    --build-arg CONSTRAINTS_GITHUB_REPOSITORY="potiuk/airflow" \
    --tag "github-different-repository-image:0.0.1"

Using custom installation sources

You can customize more aspects of the image - such as additional commands executed before apt dependencies are installed, or adding extra sources to install your dependencies from. You can see all the arguments described below but here is an example of rather complex command to customize the image based on example in this comment:

In case you need to use your custom PyPI package indexes, you can also customize PYPI sources used during image build by adding a docker-context-files/pip.conf file when building the image. This pip.conf will not be committed to the repository (it is added to .gitignore) and it will not be present in the final production image. It is added and used only in the build segment of the image. Therefore this pip.conf file can safely contain list of package indexes you want to use, usernames and passwords used for authentication. More details about pip.conf file can be found in the pip configuration.

If you used the .piprc before (some older versions of pip used it for customization), you can put it in the docker-context-files/.piprc file and it will be automatically copied to HOME directory of the airflow user.

Note, that those customizations are only available in the build segment of the Airflow image and they are not present in the final image. If you wish to extend the final image and add custom .piprc and pip.conf, you should add them in your own Dockerfile used to extend the Airflow image.

Such customizations are independent of the way how airflow is installed.

Note

Similar results could be achieved by modifying the Dockerfile manually (see below) and injecting the commands needed, but by specifying the customizations via build-args, you avoid the need of synchronizing the changes from future Airflow Dockerfiles. Those customizations should work with the future version of Airflow’s official Dockerfile at most with minimal modifications od parameter names (if any), so using the build command for your customizations makes your custom image more future-proof.

The following - rather complex - example shows capabilities of:

  • Adding airflow extras (slack, odbc)

  • Adding PyPI dependencies (azure-storage-blob, oauth2client, beautifulsoup4, dateparser, rocketchat_API,typeform)

  • Adding custom environment variables while installing apt dependencies - both DEV and RUNTIME (ACCEPT_EULA=Y')

  • Adding custom curl command for adding keys and configuring additional apt sources needed to install apt dependencies (both DEV and RUNTIME)

  • Adding custom apt dependencies, both DEV (msodbcsql17 unixodbc-dev g++) and runtime msodbcsql17 unixodbc git procps vim)

docs/docker-stack/docker-examples/customizing/custom-sources.sh

export AIRFLOW_VERSION=2.2.4
export DEBIAN_VERSION="buster"
export DOCKER_BUILDKIT=1

docker build . -f Dockerfile \
    --pull \
    --platform 'linux/amd64' \
    --build-arg PYTHON_BASE_IMAGE="python:3.7-slim-${DEBIAN_VERSION}" \
    --build-arg AIRFLOW_VERSION="${AIRFLOW_VERSION}" \
    --build-arg ADDITIONAL_AIRFLOW_EXTRAS="slack,odbc" \
    --build-arg ADDITIONAL_PYTHON_DEPS=" \
        azure-storage-blob<12.9.0 \
        oauth2client \
        beautifulsoup4 \
        dateparser \
        rocketchat_API \
        typeform" \
    --build-arg ADDITIONAL_DEV_APT_COMMAND="curl https://packages.microsoft.com/keys/microsoft.asc | \
    apt-key add --no-tty - && \
    curl https://packages.microsoft.com/config/debian/10/prod.list > /etc/apt/sources.list.d/mssql-release.list" \
    --build-arg ADDITIONAL_DEV_APT_ENV="ACCEPT_EULA=Y" \
    --build-arg ADDITIONAL_DEV_APT_DEPS="msodbcsql17 unixodbc-dev g++" \
    --build-arg ADDITIONAL_RUNTIME_APT_COMMAND="curl https://packages.microsoft.com/keys/microsoft.asc | \
    apt-key add --no-tty - && \
    curl https://packages.microsoft.com/config/debian/10/prod.list > /etc/apt/sources.list.d/mssql-release.list" \
    --build-arg ADDITIONAL_RUNTIME_APT_ENV="ACCEPT_EULA=Y" \
    --build-arg ADDITIONAL_RUNTIME_APT_DEPS="msodbcsql17 unixodbc git procps vim" \
    --tag "my-custom-sources-image:0.0.1"

Build images in security restricted environments

You can also make sure your image is only built using local constraint file and locally downloaded wheel files. This is often useful in Enterprise environments where the binary files are verified and vetted by the security teams. It is also the most complex way of building the image. You should be an expert of building and using Dockerfiles in order to use it and have to have specific needs of security if you want to follow that route.

This builds below builds the production image with packages and constraints used from the local docker-context-files rather than installed from PyPI or GitHub. It also disables MySQL client installation as it is using external installation method.

Note that as a prerequisite - you need to have downloaded wheel files. In the example below we first download such constraint file locally and then use pip download to get the .whl files needed but in most likely scenario, those wheel files should be copied from an internal repository of such .whl files. Note that AIRFLOW_VERSION_SPECIFICATION is only there for reference, the apache airflow .whl file in the right version is part of the .whl files downloaded.

Note that ‘pip download’ will only works on Linux host as some of the packages need to be compiled from sources and you cannot install them providing --platform switch. They also need to be downloaded using the same python version as the target image.

The pip download might happen in a separate environment. The files can be committed to a separate binary repository and vetted/verified by the security team and used subsequently to build images of Airflow when needed on an air-gaped system.

Example of preparing the constraint files and wheel files. Note that mysql dependency is removed as mysqlclient is installed from Oracle’s apt repository and if you want to add it, you need to provide this library from your repository if you want to build Airflow image in an “air-gaped” system.

docs/docker-stack/docker-examples/restricted/restricted_environments.sh

mkdir -p docker-context-files
export AIRFLOW_VERSION="2.2.4"
rm docker-context-files/*.whl docker-context-files/*.tar.gz docker-context-files/*.txt || true

curl -Lo "docker-context-files/constraints-3.7.txt" \
    "https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-3.7.txt"

echo
echo "Make sure you use the right python version here (should be same as in constraints)!"
echo
python --version

pip download --dest docker-context-files \
    --constraint docker-context-files/constraints-3.7.txt  \
    "apache-airflow[async,celery,elasticsearch,kubernetes,postgres,redis,ssh,statsd,virtualenv]==${AIRFLOW_VERSION}"

After this step is finished, your docker-context-files folder will contain all the packages that are needed to install Airflow from.

Those downloaded packages and constraint file can be pre-vetted by your security team before you attempt to install the image. You can also store those downloaded binary packages in your private artifact registry which allows for the flow where you will download the packages on one machine, submit only new packages for security vetting and only use the new packages when they were vetted.

On a separate (air-gaped) system, all the PyPI packages can be copied to docker-context-files where you can build the image using the packages downloaded by passing those build args:

  • INSTALL_PACKAGES_FROM_CONTEXT="true" - to use packages present in docker-context-files

  • AIRFLOW_PRE_CACHED_PIP_PACKAGES="false" - to not pre-cache packages from PyPI when building image

  • AIRFLOW_CONSTRAINTS_LOCATION=/docker-context-files/YOUR_CONSTRAINT_FILE.txt - to downloaded constraint files

  • (Optional) INSTALL_MYSQL_CLIENT="false" if you do not want to install MySQL client from the Oracle repositories.

  • (Optional) INSTALL_MSSQL_CLIENT="false" if you do not want to install MsSQL client from the Microsoft repositories.

  • (Optional) INSTALL_POSTGRES_CLIENT="false" if you do not want to install Postgres client from the Postgres repositories.

Note, that the solution we have for installing python packages from local packages, only solves the problem of “air-gaped” python installation. The Docker image also downloads apt dependencies and node-modules. Those types of dependencies are however more likely to be available in your “air-gaped” system via transparent proxies and it should automatically reach out to your private registries, however in the future the solution might be applied to both of those installation steps.

You can also use techniques described in the previous chapter to make docker build use your private apt sources or private PyPI repositories (via .pypirc) available which can be security-vetted.

If you fulfill all the criteria, you can build the image on an air-gaped system by running command similar to the below:

docs/docker-stack/docker-examples/restricted/restricted_environments.sh

export DEBIAN_VERSION="bullseye"
export DOCKER_BUILDKIT=1

docker build . \
    --pull \
    --build-arg PYTHON_BASE_IMAGE="python:3.7-slim-${DEBIAN_VERSION}" \
    --build-arg AIRFLOW_INSTALLATION_METHOD="apache-airflow" \
    --build-arg AIRFLOW_VERSION="${AIRFLOW_VERSION}" \
    --build-arg INSTALL_MYSQL_CLIENT="false" \
    --build-arg INSTALL_MSSQL_CLIENT="false" \
    --build-arg INSTALL_POSTGRES_CLIENT="true" \
    --build-arg AIRFLOW_PRE_CACHED_PIP_PACKAGES="false" \
    --build-arg DOCKER_CONTEXT_FILES="docker-context-files" \
    --build-arg INSTALL_PACKAGES_FROM_CONTEXT="true" \
    --build-arg AIRFLOW_CONSTRAINTS_LOCATION="/docker-context-files/constraints-3.7.txt" \
    --tag airflow-my-restricted-environment:0.0.1

Modifying the Dockerfile

The build arg approach is a convenience method if you do not want to manually modify the Dockerfile. Our approach is flexible enough, to be able to accommodate most requirements and customizations out-of-the-box. When you use it, you do not need to worry about adapting the image every time new version of Airflow is released. However sometimes it is not enough if you have very specific needs and want to build a very custom image. In such case you can simply modify the Dockerfile manually as you see fit and store it in your forked repository. However you will have to make sure to rebase your changes whenever new version of Airflow is released, because we might modify the approach of our Dockerfile builds in the future and you might need to resolve conflicts and rebase your changes.

There are a few things to remember when you modify the Dockerfile:

  • We are using the widely recommended pattern of .dockerignore where everything is ignored by default and only the required folders are added through exclusion (!). This allows to keep docker context small because there are many binary artifacts generated in the sources of Airflow and if they are added to the context, the time of building the image would increase significantly. If you want to add any new folders to be available in the image you must add them here with leading !

    # Ignore everything
    **
    
    # Allow only these directories
    !airflow
    ...
    
  • The docker-context-files folder is automatically added to the context of the image, so if you want to add individual files, binaries, requirement files etc you can add them there. The docker-context-files is copied to the /docker-context-files folder of the build segment of the image, so it is not present in the final image - which makes the final image smaller in case you want to use those files only in the build segment. You must copy any files from the directory manually, using COPY command if you want to get the files in your final image (in the main image segment).

More details

Build Args reference

The detailed --build-arg reference can be found in Image build arguments reference.

The architecture of the images

You can read more details about the images - the context, their parameters and internal structure in the IMAGES.rst document.

Was this entry helpful?