Object Storage¶
New in version 2.8.0.
This is an experimental feature.
All major cloud providers offer persistent data storage in object stores. These are not classic “POSIX” file systems. In order to store hundreds of petabytes of data without any single points of failure, object stores replace the classic file system directory tree with a simpler model of object-name => data. To enable remote access, operations on objects are usually offered as (slow) HTTP REST operations.
Airflow provides a generic abstraction on top of object stores, like s3, gcs, and azure blob storage.
This abstraction allows you to use a variety of object storage systems in your DAGs without having to
change you code to deal with every different object storage system. In addition, it allows you to use
most of the standard Python modules, like shutil
, that can work with file-like objects.
Support for a particular object storage system depends on the providers you have installed. For
example, if you have installed the apache-airflow-providers-google
provider, you will be able to
use the gcs
scheme for object storage. Out of the box, Airflow provides support for the file
scheme.
Note
Support for s3 requires you to install apache-airflow-providers-amazon[s3fs]
. This is because
it depends on aiobotocore
, which is not installed by default as it can create dependency
challenges with botocore
.
Cloud Object Stores are not real file systems¶
Object stores are not real file systems although they can appear so. They do not support all the operations that a real file system does. Key differences are:
No guaranteed atomic rename operation. This means that if you move a file from one location to another, it will be copied and then deleted. If the copy fails, you will lose the file.
Directories are emulated and might make working with them slow. For example, listing a directory might require listing all the objects in the bucket and filtering them by prefix.
Seeking within a file may require significant call overhead hurting performance or might not be supported at all.
Airflow relies on fsspec to provide a consistent experience across different object storage systems. It implements local file caching to speed up access. However, you should be aware of the limitations of object storage when designing your DAGs.
Basic Use¶
To use object storage, you need to instantiate a Path (see below) object with the URI of the object you want to interact with. For example, to point to a bucket in s3, you would do the following:
from airflow.io.path import ObjectStoragePath
base = ObjectStoragePath("s3://aws_default@my-bucket/")
The username part of the URI is optional. It can alternatively be passed in as a separate keyword argument:
# Equivalent to the previous example.
base = ObjectStoragePath("s3://my-bucket/", conn_id="aws_default")
Listing file-objects:
@task
def list_files() -> list[ObjectStoragePath]:
files = [f for f in base.iterdir() if f.is_file()]
return files
Navigating inside a directory tree:
base = ObjectStoragePath("s3://my-bucket/")
subdir = base / "subdir"
# prints ObjectStoragePath("s3://my-bucket/subdir")
print(subdir)
Opening a file:
@task
def read_file(path: ObjectStoragePath) -> str:
with path.open() as f:
return f.read()
Leveraging XCOM, you can pass paths between tasks:
@task
def create(path: ObjectStoragePath) -> ObjectStoragePath:
return path / "new_file.txt"
@task
def write_file(path: ObjectStoragePath, content: str):
with path.open("wb") as f:
f.write(content)
new_file = create(base)
write = write_file(new_file, b"data")
read >> write
Configuration¶
In its basic use, the object storage abstraction does not require much configuration and relies upon the
standard Airflow connection mechanism. This means that you can use the conn_id
argument to specify
the connection to use. Any settings by the connection are pushed down to the underlying implementation.
For example, if you are using s3, you can specify the aws_access_key_id
and aws_secret_access_key
but also add extra arguments like endpoint_url
to specify a custom endpoint.
Alternative backends¶
It is possible to configure an alternative backend for a scheme or protocol. This is done by attaching
a backend
to the scheme. For example, to enable the databricks backend for the dbfs
scheme, you
would do the following:
from airflow.io.path import ObjectStoragePath
from airflow.io.store import attach
from fsspec.implementations.dbfs import DBFSFileSystem
attach(protocol="dbfs", fs=DBFSFileSystem(instance="myinstance", token="mytoken"))
base = ObjectStoragePath("dbfs://my-location/")
Note
To reuse the registration across tasks make sure to attach the backend at the top-level of your DAG. Otherwise, the backend will not be available across multiple tasks.
Path API¶
The object storage abstraction is implemented as a Path API.
and builds upon Universal Pathlib This means that you can mostly use
the same API to interact with object storage as you would with a local filesystem. In this section we only list the
differences between the two APIs. Extended operations beyond the standard Path API, like copying and moving, are listed
in the next section. For details about each operation, like what arguments they take, see the documentation of
the ObjectStoragePath
class.
mkdir¶
Create a directory entry at the specified path or within a bucket/container. For systems that don’t have true directories, it may create a directory entry for this instance only and not affect the real filesystem.
If parents
is True
, any missing parents of this path are created as needed.
touch¶
Create a file at this given path, or update the timestamp. If truncate
is True
, the file is truncated, which is
the default. If the file already exists, the function succeeds if exists_ok
is true (and its modification time is
updated to the current time), otherwise FileExistsError
is raised.
stat¶
Returns a stat_result
like object that supports the following attributes: st_size
, st_mtime
, st_mode
,
but also acts like a dictionary that can provide additional metadata about the object. For example, for s3 it will,
return the additional keys like: ['ETag', 'ContentType']
. If your code needs to be portable across different object
stores do not rely on the extended metadata.
Extensions¶
The following operations are not part of the standard Path API, but are supported by the object storage abstraction.
bucket¶
Returns the bucket name.
checksum¶
Returns the checksum of the file.
container¶
Alias of bucket
fs¶
Convenience attribute to access an instantiated filesystem
key¶
Returns the object key.
path¶
the fsspec
compatible path for use with filesystem instances
protocol¶
the filesystem_spec protocol.
read_block¶
Read a block of bytes from the file at this given path.
Starting at offset of the file, read length bytes. If delimiter is set then we ensure that the read starts and stops at delimiter boundaries that follow the locations offset and offset + length. If offset is zero then we start at zero. The bytestring returned WILL include the end delimiter string.
If offset+length is beyond the eof, reads to eof.
sign¶
Create a signed URL representing the given path. Some implementations allow temporary URLs to be generated, as a way of delegating credentials.
size¶
Returns the size in bytes of the file at the given path.
storage_options¶
The storage options for instantiating the underlying filesystem.
ukey¶
Hash of file properties, to tell if it has changed.
Copying and Moving¶
This documents the expected behavior of the copy
and move
operations, particularly for cross object store (e.g.
file -> s3) behavior. Each method copies or moves files or directories from a source
to a target
location.
The intended behavior is the same as specified by
fsspec
. For cross object store directory copying,
Airflow needs to walk the directory tree and copy each file individually. This is done by streaming each file from the
source to the target.
External Integrations¶
Many other projects, like DuckDB, Apache Iceberg etc, can make use of the object storage abstraction. Often this is
done by passing the underlying fsspec
implementation. For this this purpose ObjectStoragePath
exposes
the fs
property. For example, the following works with duckdb
so that the connection details from Airflow
are used to connect to s3 and a parquet file, indicated by a ObjectStoragePath
, is read:
import duckdb
from airflow.io.path import ObjectStoragePath
path = ObjectStoragePath("s3://my-bucket/my-table.parquet", conn_id="aws_default")
conn = duckdb.connect(database=":memory:")
conn.register_filesystem(path.fs)
conn.execute(f"CREATE OR REPLACE TABLE my_table AS SELECT * FROM read_parquet('{path}');")