Building the image

Before you dive-deeply in the way how the Airflow Image is build, let us first explain why you might need to build the custom container image and we show a few typical ways you can do it.

Why custom image ?

The Apache Airflow community, releases Docker Images which are reference images for Apache Airflow. However, Airflow has more than 60 community managed providers (installable via extras) and some of the default extras/providers installed are not used by everyone, sometimes others extras/providers are needed, sometimes (very often actually) you need to add your own custom dependencies, packages or even custom providers.

In Kubernetes and Docker terms this means that you need another image with your specific requirements. This is why you should learn how to build your own Docker (or more properly Container) image. You might be tempted to use the reference image and dynamically install the new packages while starting your containers, but this is a bad idea for multiple reasons - starting from fragility of the build and ending with the extra time needed to install those packages - which has to happen every time every container starts. The only viable way to deal with new dependencies and requirements in production is to build and use your own image. You should only use installing dependencies dynamically in case of "hobbyist" and "quick start" scenarios when you want to iterate quickly to try things out and later replace it with your own images.

How to build your own image

There are several most-typical scenarios that you will encounter and here is a quick recipe on how to achieve your goal quickly. In order to understand details you can read further, but for the simple cases using typical tools here are the simple examples.

In the simplest case building your image consists of those steps:

  1. Create your own Dockerfile (name it Dockerfile) where you add:

  • information what your image should be based on (for example FROM: apache/airflow:|airflow-version|-python3.8

  • additional steps that should be executed in your image (typically in the form of RUN <command>)

  1. Build your image. This can be done with docker CLI tools and examples below assume docker is used. There are other tools like kaniko or podman that allow you to build the image, but docker is so far the most popular and developer-friendly tool out there. Typical way of building the image looks like follows (my-image:0.0.1 is the custom tag of your image containing version). In case you use some kind of registry where you will be using the image from, it is usually named in the form of registry/image-name. The name of the image has to be configured for the deployment method your image will be deployed. This can be set for example as image name in the docker-compose file or in the Helm chart.

docker build . -f Dockerfile --tag my-image:0.0.1
  1. [Optional] Test the image. Airflow contains tool that allows you to test the image. This step however, requires locally checked out or extracted Airflow sources. If you happen to have the sources you can test the image by running this command (in airflow root folder). The output will tell you if the image is "good-to-go".

./scripts/ci/tools/verify_docker_image.sh PROD my-image:0.0.1
  1. Once you build the image locally you have usually several options to make them available for your deployment:

  • For docker-compose deployment, that's all you need. The image is stored in docker engine cache and docker compose will use it from there.

  • For some - development targeted - Kubernetes deployments you can load the images directly to Kubernetes clusters. Clusters such as kind or minikube have dedicated load method to load the images to the cluster.

  • Last but not least - you can push your image to a remote registry which is the most common way of storing and exposing the images, and it is most portable way of publishing the image. Both Docker-Compose and Kubernetes can make use of images exposed via registries.

The most common scenarios where you want to build your own image are adding a new apt package, adding a new PyPI dependency and embedding DAGs into the image. Example Dockerfiles for those scenarios are below, and you can read further for more complex cases which might involve either extending or customizing the image.

Adding new apt package

The following example adds vim to the airflow image.

docs/docker-stack/docker-examples/extending/add-apt-packages/Dockerfile

FROM apache/airflow:2.1.4
USER root
RUN apt-get update \
  && apt-get install -y --no-install-recommends \
         vim \
  && apt-get autoremove -yqq --purge \
  && apt-get clean \
  && rm -rf /var/lib/apt/lists/*
USER airflow

Adding a new PyPI package

The following example adds lxml python package from PyPI to the image.

docs/docker-stack/docker-examples/extending/add-pypi-packages/Dockerfile

FROM apache/airflow:2.1.4
RUN pip install --no-cache-dir lxml

Embedding DAGs

The following example adds test_dag.py to your image in the /opt/airflow/dags folder.

docs/docker-stack/docker-examples/extending/embedding-dags/Dockerfile

FROM apache/airflow:2.1.4

COPY --chown=airflow:root test_dag.py /opt/airflow/dags

docs/docker-stack/docker-examples/extending/embedding-dags/test_dag.pyView Source

"""This dag only runs some simple tasks to test Airflow's task execution."""
from datetime import datetime, timedelta

from airflow.models.dag import DAG
from airflow.operators.dummy import DummyOperator
from airflow.utils.dates import days_ago

now = datetime.now()
now_to_the_hour = (now - timedelta(0, 0, 0, 0, 0, 3)).replace(minute=0, second=0, microsecond=0)
START_DATE = now_to_the_hour
DAG_NAME = 'test_dag_v1'

default_args = {'owner': 'airflow', 'depends_on_past': True, 'start_date': days_ago(2)}
dag = DAG(DAG_NAME, schedule_interval='*/10 * * * *', default_args=default_args)

run_this_1 = DummyOperator(task_id='run_this_1', dag=dag)
run_this_2 = DummyOperator(task_id='run_this_2', dag=dag)
run_this_2.set_upstream(run_this_1)
run_this_3 = DummyOperator(task_id='run_this_3', dag=dag)
run_this_3.set_upstream(run_this_2)

Extending vs. customizing the image

You might want to know very quickly how you can extend or customize the existing image for Apache Airflow. This chapter gives you a short answer to those questions.

Here is the comparison of the two types of building images. Here is your guide if you want to choose how you want to build your image.

Extending

Customizing

Can be built without airflow sources

Yes

No

Uses familiar 'FROM ' pattern of image building

Yes

No

Requires only basic knowledge about images

Yes

No

Builds quickly

Yes

No

Produces image heavily optimized for size

No

Yes

Can build from custom airflow sources (forks)

No

Yes

Can build on air-gaped system

No

Yes

TL;DR; If you have a need to build custom image, it is easier to start with "Extending" however if your dependencies require compilation step or when your require to build the image from security vetted packages, switching to "Customizing" the image provides much more optimized images. In the example further where we compare equivalent "Extending" and "Customizing" the image, similar images build by Extending vs. Customization had shown 1.1GB vs 874MB image sizes respectively - with 20% improvement in size of the Customized image.

Note

You can also combine both - customizing & extending the image in one. You can build your optimized base image first using customization method (for example by your admin team) with all the heavy compilation required dependencies and you can publish it in your registry and let others extend your image using FROM and add their own lightweight dependencies. This reflects well the split where typically "Casual" users will Extend the image and "Power-users" will customize it.

Airflow Summit 2020's Production Docker Image talk provides more details about the context, architecture and customization/extension methods for the Production Image.

Extending the image

Extending the image is easiest if you just need to add some dependencies that do not require compiling. The compilation framework of Linux (so called build-essential) is pretty big, and for the production images, size is really important factor to optimize for, so our Production Image does not contain build-essential. If you need compiler like gcc or g++ or make/cmake etc. - those are not found in the image and it is recommended that you follow the "customize" route instead.

How to extend the image - it is something you are most likely familiar with - simply build a new image using Dockerfile's FROM directive and add whatever you need. Then you can add your Debian dependencies with apt or PyPI dependencies with pip install or any other stuff you need.

You should be aware, about a few things:

  • The production image of airflow uses "airflow" user, so if you want to add some of the tools as root user, you need to switch to it with USER directive of the Dockerfile and switch back to airflow user when you are done. Also you should remember about following the best practices of Dockerfiles to make sure your image is lean and small.

  • The PyPI dependencies in Apache Airflow are installed in the user library, of the "airflow" user, so PIP packages are installed to ~/.local folder as if the --user flag was specified when running PIP. Note also that using --no-cache-dir is a good idea that can help to make your image smaller.

Note

Only as of 2.0.1 image the --user flag is turned on by default by setting PIP_USER environment variable to true. This can be disabled by un-setting the variable or by setting it to false. In the 2.0.0 image you had to add the --user flag as pip install --user command.

  • If your apt, or PyPI dependencies require some of the build-essential or other packages that need to compile your python dependencies, then your best choice is to follow the "Customize the image" route, because you can build a highly-optimized (for size) image this way. However it requires to checkout sources of Apache Airflow, so you might still want to choose to add build-essential to your image, even if your image will be significantly bigger.

  • You can also embed your dags in the image by simply adding them with COPY directive of Airflow. The DAGs in production image are in /opt/airflow/dags folder.

  • You can build your image without any need for Airflow sources. It is enough that you place the Dockerfile and any files that are referred to (such as Dag files) in a separate directory and run a command docker build . --tag my-image:my-tag (where my-image is the name you want to name it and my-tag is the tag you want to tag the image with.

  • If your way of extending image requires to create writable directories, you MUST remember about adding umask 0002 step in your RUN command. This is necessary in order to accommodate our approach for running the image with an arbitrary user. Such user will always run with GID=0 - the entrypoint will prevent non-root GIDs. You can read more about it in arbitrary docker user documentation for the entrypoint. The umask 0002 is set as default when you enter the image, so any directories you create by default in runtime, will have GID=0 and will be group-writable.

Note

When you build image for Airflow version < 2.1 (for example 2.0.2 or 1.10.15) the image is build with PIP 20.2.4 because PIP21+ is only supported for Airflow 2.1+

Note

Only as of 2.0.2 the default group of airflow user is root. Previously it was airflow, so if you are building your images based on an earlier image, you need to manually change the default group for airflow user:

RUN usermod -g 0 airflow

Examples of image extending

Example of upgrading Airflow Provider packages

The Airflow Providers are released independently of core Airflow and sometimes you might want to upgrade specific providers only to fix some problems or use features available in that provider version. Here is an example of how you can do it

docs/docker-stack/docker-examples/extending/add-providers/Dockerfile

FROM apache/airflow:2.1.4
RUN pip install --no-cache-dir apache-airflow-providers-docker==2.1.0

Example of adding apt package

The following example adds vim to the airflow image.

docs/docker-stack/docker-examples/extending/add-apt-packages/Dockerfile

FROM apache/airflow:2.1.4
USER root
RUN apt-get update \
  && apt-get install -y --no-install-recommends \
         vim \
  && apt-get autoremove -yqq --purge \
  && apt-get clean \
  && rm -rf /var/lib/apt/lists/*
USER airflow

Example of adding PyPI package

The following example adds lxml python package from PyPI to the image.

docs/docker-stack/docker-examples/extending/add-pypi-packages/Dockerfile

FROM apache/airflow:2.1.4
RUN pip install --no-cache-dir lxml

Example when writable directory is needed

The following example adds a new directory that is supposed to be writable for any arbitrary user running the container.

docs/docker-stack/docker-examples/extending/writable-directory/Dockerfile

FROM apache/airflow:2.1.4
RUN umask 0002; \
    mkdir -p ~/writeable-directory

Example when you add packages requiring compilation

The following example adds mpi4py package which requires both build-essential and mpi compiler.

docs/docker-stack/docker-examples/extending/add-build-essential-extend/Dockerfile

FROM apache/airflow:2.1.4
USER root
RUN apt-get update \
  && apt-get install -y --no-install-recommends \
         build-essential libopenmpi-dev \
  && apt-get autoremove -yqq --purge \
  && apt-get clean \
  && rm -rf /var/lib/apt/lists/*
USER airflow
RUN pip install --no-cache-dir mpi4py

The size of this image is ~ 1.1 GB when build. As you will see further, you can achieve 20% reduction in size of the image in case you use "Customizing" rather than "Extending" the image.

Example when you want to embed DAGs

The following example adds test_dag.py to your image in the /opt/airflow/dags folder.

docs/docker-stack/docker-examples/extending/embedding-dags/Dockerfile

FROM apache/airflow:2.1.4

COPY --chown=airflow:root test_dag.py /opt/airflow/dags

docs/docker-stack/docker-examples/extending/embedding-dags/test_dag.pyView Source

"""This dag only runs some simple tasks to test Airflow's task execution."""
from datetime import datetime, timedelta

from airflow.models.dag import DAG
from airflow.operators.dummy import DummyOperator
from airflow.utils.dates import days_ago

now = datetime.now()
now_to_the_hour = (now - timedelta(0, 0, 0, 0, 0, 3)).replace(minute=0, second=0, microsecond=0)
START_DATE = now_to_the_hour
DAG_NAME = 'test_dag_v1'

default_args = {'owner': 'airflow', 'depends_on_past': True, 'start_date': days_ago(2)}
dag = DAG(DAG_NAME, schedule_interval='*/10 * * * *', default_args=default_args)

run_this_1 = DummyOperator(task_id='run_this_1', dag=dag)
run_this_2 = DummyOperator(task_id='run_this_2', dag=dag)
run_this_2.set_upstream(run_this_1)
run_this_3 = DummyOperator(task_id='run_this_3', dag=dag)
run_this_3.set_upstream(run_this_2)

Customizing the image

Customizing the image is an optimized way of adding your own dependencies to the image - better suited to prepare highly optimized (for size) production images, especially when you have dependencies that require to be compiled before installing (such as mpi4py).

It also allows more sophisticated usages, needed by "Power-users" - for example using forked version of Airflow, or building the images from security-vetted sources.

The big advantage of this method is that it produces optimized image even if you need some compile-time dependencies that are not needed in the final image.

The disadvantage is that you need to use Airflow Sources to build such images from the official distribution repository of Apache Airflow for the released versions, or from the checked out sources (using release tags or main branches) in the Airflow GitHub Project or from your own fork if you happen to do maintain your own fork of Airflow.

Another disadvantage is that the pattern of building Docker images with --build-arg is less familiar to developers of such images. However it is quite well-known to "power-users". That's why the customizing flow is better suited for those users who have more familiarity and have more custom requirements.

The image also usually builds much longer than the equivalent "Extended" image because instead of extending the layers that are already coming from the base image, it rebuilds the layers needed to add extra dependencies needed at early stages of image building.

When customizing the image you can choose a number of options how you install Airflow:

  • From the PyPI releases (default)

  • From the custom installation sources - using additional/replacing the original apt or PyPI repositories

  • From local sources. This is used mostly during development.

  • From tag or branch, or specific commit from a GitHub Airflow repository (or fork). This is particularly useful when you build image for a custom version of Airflow that you keep in your fork and you do not want to release the custom Airflow version to PyPI.

  • From locally stored binary packages for Airflow, Airflow Providers and other dependencies. This is particularly useful if you want to build Airflow in a highly-secure environment where all such packages must be vetted by your security team and stored in your private artifact registry. This also allows to build airflow image in an air-gaped environment.

  • Side note. Building Airflow in an air-gaped environment sounds pretty funny, doesn't it?

You can also add a range of customizations while building the image:

  • base python image you use for Airflow

  • version of Airflow to install

  • extras to install for Airflow (or even removing some default extras)

  • additional apt/python dependencies to use while building Airflow (DEV dependencies)

  • add requirements.txt file to docker-context-files directory to add extra requirements

  • additional apt/python dependencies to install for runtime version of Airflow (RUNTIME dependencies)

  • additional commands and variables to set if needed during building or preparing Airflow runtime

  • choosing constraint file to use when installing Airflow

Additional explanation is needed for the last point. Airflow uses constraints to make sure that it can be predictably installed, even if some new versions of Airflow dependencies are released (or even dependencies of our dependencies!). The docker image and accompanying scripts usually determine automatically the right versions of constraints to be used based on the Airflow version installed and Python version. For example 2.0.2 version of Airflow installed from PyPI uses constraints from constraints-2.0.2 tag). However in some cases - when installing airflow from GitHub for example - you have to manually specify the version of constraints used, otherwise it will default to the latest version of the constraints which might not be compatible with the version of Airflow you use.

You can also download any version of Airflow constraints and adapt it with your own set of constraints and manually set your own versions of dependencies in your own constraints and use the version of constraints that you manually prepared.

You can read more about constraints in Installation from PyPI

Note that if you place requirements.txt in the docker-context-files folder, it will be used to install all requirements declared there. It is recommended that the file contains specified version of dependencies to add with == version specifier, to achieve stable set of requirements, independent if someone releases a newer version. However you have to make sure to update those requirements and rebuild the images to account for latest security fixes.

Examples of image customizing

Building from PyPI packages

This is the basic way of building the custom images from sources.

The following example builds the production image in version 3.6 with latest PyPI-released Airflow, with default set of Airflow extras and dependencies. The 2.0.2 constraints are used automatically.

docs/docker-stack/docker-examples/customizing/stable-airflow.sh

docker build . \
    --tag "my-stable-airflow:0.0.1"

The following example builds the production image in version 3.7 with default extras from 2.0.2 PyPI package. The 2.0.2 constraints are used automatically.

docs/docker-stack/docker-examples/customizing/pypi-selected-version.sh

docker build . \
    --build-arg PYTHON_BASE_IMAGE="python:3.7-slim-buster" \
    --build-arg AIRFLOW_VERSION="2.0.2" \
    --tag "my-pypi-selected-version:0.0.1"

The following example builds the production image in version 3.8 with additional airflow extras (mssql,hdfs) from 2.0.2 PyPI package, and additional dependency (oauth2client).

docs/docker-stack/docker-examples/customizing/pypi-extras-and-deps.sh

docker build . \
    --build-arg PYTHON_BASE_IMAGE="python:3.8-slim-buster" \
    --build-arg AIRFLOW_VERSION="2.0.2" \
    --build-arg ADDITIONAL_AIRFLOW_EXTRAS="mssql,hdfs" \
    --build-arg ADDITIONAL_PYTHON_DEPS="oauth2client" \
    --tag "my-pypi-extras-and-deps:0.0.1"

The following example adds mpi4py package which requires both build-essential and mpi compiler.

docs/docker-stack/docker-examples/customizing/add-build-essential-custom.sh

docker build . \
    --build-arg PYTHON_BASE_IMAGE="python:3.6-slim-buster" \
    --build-arg AIRFLOW_VERSION="2.0.2" \
    --build-arg ADDITIONAL_PYTHON_DEPS="mpi4py" \
    --build-arg ADDITIONAL_DEV_APT_DEPS="libopenmpi-dev" \
    --build-arg ADDITIONAL_RUNTIME_APT_DEPS="openmpi-common" \
    --tag "my-build-essential-image:0.0.1"

The above image is equivalent of the "extended" image from previous chapter but it's size is only 874 MB. Comparing to 1.1 GB of the "extended image" this is about 230 MB less, so you can achieve ~20% improvement in size of the image by using "customization" vs. extension. The saving can increase in case you have more complex dependencies to build.

Building optimized images

The following example the production image in version 3.6 with additional airflow extras from 2.0.2 PyPI package but it includes additional apt dev and runtime dependencies.

The dev dependencies are those that require build-essential and usually need to involve recompiling of some python dependencies so those packages might require some additional DEV dependencies to be present during recompilation. Those packages are not needed at runtime, so we only install them for the "build" time. They are not installed in the final image, thus producing much smaller images. In this case pandas requires recompilation so it also needs gcc and g++ as dev APT dependencies. The jre-headless does not require recompiling so it can be installed as the runtime APT dependency.

docs/docker-stack/docker-examples/customizing/pypi-dev-runtime-deps.sh

docker build . \
    --build-arg PYTHON_BASE_IMAGE="python:3.6-slim-buster" \
    --build-arg AIRFLOW_VERSION="2.0.2" \
    --build-arg ADDITIONAL_AIRFLOW_EXTRAS="jdbc" \
    --build-arg ADDITIONAL_PYTHON_DEPS="pandas" \
    --build-arg ADDITIONAL_DEV_APT_DEPS="gcc g++" \
    --build-arg ADDITIONAL_RUNTIME_APT_DEPS="default-jre-headless" \
    --tag "my-pypi-dev-runtime:0.0.1"

Building from GitHub

This method is usually used for development purpose. But in case you have your own fork you can point it to your forked version of source code without having to release it to PyPI. It is enough to have a branch or tag in your repository and use the tag or branch in the URL that you point the installation to.

In case of GitHyb builds you need to pass the constraints reference manually in case you want to use specific constraints, otherwise the default constraints-main is used.

The following example builds the production image in version 3.7 with default extras from the latest main version and constraints are taken from latest version of the constraints-main branch in GitHub.

docs/docker-stack/docker-examples/customizing/github-main.sh

docker build . \
    --build-arg PYTHON_BASE_IMAGE="python:3.7-slim-buster" \
    --build-arg AIRFLOW_INSTALLATION_METHOD="https://github.com/apache/airflow/archive/main.tar.gz#egg=apache-airflow" \
    --build-arg AIRFLOW_CONSTRAINTS_REFERENCE="constraints-main" \
    --tag "my-github-main:0.0.1"

The following example builds the production image with default extras from the latest v2-*-test version and constraints are taken from the latest version of the constraints-2-* branch in GitHub (for example v2-1-test branch matches constraints-2-1). Note that this command might fail occasionally as only the "released version" constraints when building a version and "main" constraints when building main are guaranteed to work.

docs/docker-stack/docker-examples/customizing/github-v2-1-test.sh

docker build . \
    --build-arg PYTHON_BASE_IMAGE="python:3.8-slim-buster" \
    --build-arg AIRFLOW_INSTALLATION_METHOD="https://github.com/apache/airflow/archive/v2-1-test.tar.gz#egg=apache-airflow" \
    --build-arg AIRFLOW_CONSTRAINTS_REFERENCE="constraints-2-0" \
    --tag "my-github-v2-1:0.0.1"

You can also specify another repository to build from. If you also want to use different constraints repository source, you must specify it as additional CONSTRAINTS_GITHUB_REPOSITORY build arg.

The following example builds the production image using potiuk/airflow fork of Airflow and constraints are also downloaded from that repository.

docs/docker-stack/docker-examples/customizing/github-different-repository.sh

docker build . \
    --build-arg PYTHON_BASE_IMAGE="python:3.8-slim-buster" \
    --build-arg AIRFLOW_INSTALLATION_METHOD="https://github.com/potiuk/airflow/archive/main.tar.gz#egg=apache-airflow" \
    --build-arg AIRFLOW_CONSTRAINTS_REFERENCE="constraints-main" \
    --build-arg CONSTRAINTS_GITHUB_REPOSITORY="potiuk/airflow" \
    --tag "github-different-repository-image:0.0.1"

Using custom installation sources

You can customize more aspects of the image - such as additional commands executed before apt dependencies are installed, or adding extra sources to install your dependencies from. You can see all the arguments described below but here is an example of rather complex command to customize the image based on example in this comment:

In case you need to use your custom PyPI package indexes, you can also customize PYPI sources used during image build by adding a docker-context-files/.pypirc file when building the image. This .pypirc will not be committed to the repository (it is added to .gitignore) and it will not be present in the final production image. It is added and used only in the build segment of the image. Therefore this .pypirc file can safely contain list of package indexes you want to use, usernames and passwords used for authentication. More details about .pypirc file can be found in the pypirc specification.

Such customizations are independent of the way how airflow is installed.

Note

Similar results could be achieved by modifying the Dockerfile manually (see below) and injecting the commands needed, but by specifying the customizations via build-args, you avoid the need of synchronizing the changes from future Airflow Dockerfiles. Those customizations should work with the future version of Airflow's official Dockerfile at most with minimal modifications od parameter names (if any), so using the build command for your customizations makes your custom image more future-proof.

The following - rather complex - example shows capabilities of:

  • Adding airflow extras (slack, odbc)

  • Adding PyPI dependencies (azure-storage-blob, oauth2client, beautifulsoup4, dateparser, rocketchat_API,typeform)

  • Adding custom environment variables while installing apt dependencies - both DEV and RUNTIME (ACCEPT_EULA=Y')

  • Adding custom curl command for adding keys and configuring additional apt sources needed to install apt dependencies (both DEV and RUNTIME)

  • Adding custom apt dependencies, both DEV (msodbcsql17 unixodbc-dev g++) and runtime msodbcsql17 unixodbc git procps vim)

docs/docker-stack/docker-examples/customizing/custom-sources.sh

docker build . -f Dockerfile \
    --build-arg PYTHON_BASE_IMAGE="python:3.7-slim-buster" \
    --build-arg AIRFLOW_VERSION="2.0.2" \
    --build-arg ADDITIONAL_AIRFLOW_EXTRAS="slack,odbc" \
    --build-arg ADDITIONAL_PYTHON_DEPS=" \
        azure-storage-blob \
        oauth2client \
        beautifulsoup4 \
        dateparser \
        rocketchat_API \
        typeform" \
    --build-arg ADDITIONAL_DEV_APT_COMMAND="curl https://packages.microsoft.com/keys/microsoft.asc | \
    apt-key add --no-tty - && \
    curl https://packages.microsoft.com/config/debian/10/prod.list > /etc/apt/sources.list.d/mssql-release.list" \
    --build-arg ADDITIONAL_DEV_APT_ENV="ACCEPT_EULA=Y" \
    --build-arg ADDITIONAL_DEV_APT_DEPS="msodbcsql17 unixodbc-dev g++" \
    --build-arg ADDITIONAL_RUNTIME_APT_COMMAND="curl https://packages.microsoft.com/keys/microsoft.asc | \
    apt-key add --no-tty - && \
    curl https://packages.microsoft.com/config/debian/10/prod.list > /etc/apt/sources.list.d/mssql-release.list" \
    --build-arg ADDITIONAL_RUNTIME_APT_ENV="ACCEPT_EULA=Y" \
    --build-arg ADDITIONAL_RUNTIME_APT_DEPS="msodbcsql17 unixodbc git procps vim" \
    --tag "my-custom-sources-image:0.0.1"

Build images in security restricted environments

You can also make sure your image is only build using local constraint file and locally downloaded wheel files. This is often useful in Enterprise environments where the binary files are verified and vetted by the security teams. It is also the most complex way of building the image. You should be an expert of building and using Dockerfiles in order to use it and have to have specific needs of security if you want to follow that route.

This builds below builds the production image with packages and constraints used from the local docker-context-files rather than installed from PyPI or GitHub. It also disables MySQL client installation as it is using external installation method.

Note that as a prerequisite - you need to have downloaded wheel files. In the example below we first download such constraint file locally and then use pip download to get the .whl files needed but in most likely scenario, those wheel files should be copied from an internal repository of such .whl files. Note that AIRFLOW_VERSION_SPECIFICATION is only there for reference, the apache airflow .whl file in the right version is part of the .whl files downloaded.

Note that 'pip download' will only works on Linux host as some of the packages need to be compiled from sources and you cannot install them providing --platform switch. They also need to be downloaded using the same python version as the target image.

The pip download might happen in a separate environment. The files can be committed to a separate binary repository and vetted/verified by the security team and used subsequently to build images of Airflow when needed on an air-gaped system.

Example of preparing the constraint files and wheel files. Note that mysql dependency is removed as mysqlclient is installed from Oracle's apt repository and if you want to add it, you need to provide this library from you repository if you want to build Airflow image in an "air-gaped" system.

docs/docker-stack/docker-examples/restricted/restricted_environments.sh

rm docker-context-files/*.whl docker-context-files/*.tar.gz docker-context-files/*.txt || true

curl -Lo "docker-context-files/constraints-3.7.txt" \
    https://raw.githubusercontent.com/apache/airflow/constraints-2.0.2/constraints-3.7.txt

# For Airflow pre 2.1 you need to use PIP 20.2.4 to install/download Airflow packages.
pip install pip==20.2.4

pip download --dest docker-context-files \
    --constraint docker-context-files/constraints-3.7.txt  \
    "apache-airflow[async,aws,azure,celery,dask,elasticsearch,gcp,kubernetes,postgres,redis,slack,ssh,statsd,virtualenv]==2.0.2"

After this step is finished, your docker-context-files folder will contain all the packages that are needed to install Airflow from.

Those downloaded packages and constraint file can be pre-vetted by your security team before you attempt to install the image. You can also store those downloaded binary packages in your private artifact registry which allows for the flow where you will download the packages on one machine, submit only new packages for security vetting and only use the new packages when they were vetted.

On a separate (air-gaped) system, all the PyPI packages can be copied to docker-context-files where you can build the image using the packages downloaded by passing those build args:

  • INSTALL_FROM_DOCKER_CONTEXT_FILES="true" - to use packages present in docker-context-files

  • AIRFLOW_PRE_CACHED_PIP_PACKAGES="false" - to not pre-cache packages from PyPI when building image

  • AIRFLOW_CONSTRAINTS_LOCATION=/docker-context-files/YOUR_CONSTRAINT_FILE.txt - to downloaded constraint files

  • (Optional) INSTALL_MYSQL_CLIENT="false" if you do not want to install MySQL client from the Oracle repositories. In this case also make sure that your

Note, that the solution we have for installing python packages from local packages, only solves the problem of "air-gaped" python installation. The Docker image also downloads apt dependencies and node-modules. Those type of dependencies are however more likely to be available in your "air-gaped" system via transparent proxies and it should automatically reach out to your private registries, however in the future the solution might be applied to both of those installation steps.

You can also use techniques described in the previous chapter to make docker build use your private apt sources or private PyPI repositories (via .pypirc) available which can be security-vetted.

If you fulfill all the criteria, you can build the image on an air-gaped system by running command similar to the below:

docs/docker-stack/docker-examples/restricted/restricted_environments.sh

docker build . \
    --build-arg PYTHON_BASE_IMAGE="python:3.7-slim-buster" \
    --build-arg AIRFLOW_INSTALLATION_METHOD="apache-airflow" \
    --build-arg AIRFLOW_VERSION="2.0.2" \
    --build-arg INSTALL_MYSQL_CLIENT="false" \
    --build-arg AIRFLOW_PRE_CACHED_PIP_PACKAGES="false" \
    --build-arg INSTALL_FROM_DOCKER_CONTEXT_FILES="true" \
    --build-arg AIRFLOW_CONSTRAINTS_LOCATION="/docker-context-files/constraints-3.7.txt" \
    --tag my-restricted-environment:0.0.1

Modifying the Dockerfile

The build arg approach is a convenience method if you do not want to manually modify the Dockerfile. Our approach is flexible enough, to be able to accommodate most requirements and customizations out-of-the-box. When you use it, you do not need to worry about adapting the image every time new version of Airflow is released. However sometimes it is not enough if you have very specific needs and want to build a very custom image. In such case you can simply modify the Dockerfile manually as you see fit and store it in your forked repository. However you will have to make sure to rebase your changes whenever new version of Airflow is released, because we might modify the approach of our Dockerfile builds in the future and you might need to resolve conflicts and rebase your changes.

There are a few things to remember when you modify the Dockerfile:

  • We are using the widely recommended pattern of .dockerignore where everything is ignored by default and only the required folders are added through exclusion (!). This allows to keep docker context small because there are many binary artifacts generated in the sources of Airflow and if they are added to the context, the time of building the image would increase significantly. If you want to add any new folders to be available in the image you must add it here with leading !

    # Ignore everything
    **
    
    # Allow only these directories
    !airflow
    ...
    
  • The docker-context-files folder is automatically added to the context of the image, so if you want to add individual files, binaries, requirement files etc you can add them there. The docker-context-files is copied to the /docker-context-files folder of the build segment of the image, so it is not present in the final image - which makes the final image smaller in case you want to use those files only in the build segment. You must copy any files from the directory manually, using COPY command if you want to get the files in your final image (in the main image segment).

More details

Build Args reference

The detailed --build-arg reference can be found in Image build arguments reference.

The architecture of the images

You can read more details about the images - the context, their parameters and internal structure in the IMAGES.rst document.

Was this entry helpful?