airflow.providers.apache.beam.operators.beam
¶
This module contains Apache Beam operators.
Module Contents¶
Classes¶
Helper class to store common, Dataflow specific logic for both. |
|
Abstract base class for Beam Pipeline Operators. |
|
Launch Apache Beam pipelines written in Python. |
|
Launching Apache Beam pipelines written in Java. |
|
Launch Apache Beam pipelines written in Go. |
- class airflow.providers.apache.beam.operators.beam.BeamDataflowMixin[source]¶
Helper class to store common, Dataflow specific logic for both.
BeamRunPythonPipelineOperator
,BeamRunJavaPipelineOperator
andBeamRunGoPipelineOperator
.- dataflow_hook: airflow.providers.google.cloud.hooks.dataflow.DataflowHook | None[source]¶
- class airflow.providers.apache.beam.operators.beam.BeamBasePipelineOperator(*, runner='DirectRunner', default_pipeline_options=None, pipeline_options=None, gcp_conn_id='google_cloud_default', dataflow_config=None, **kwargs)[source]¶
Bases:
airflow.models.BaseOperator
,BeamDataflowMixin
,abc.ABC
Abstract base class for Beam Pipeline Operators.
- Parameters
runner (str) – Runner on which pipeline will be run. By default “DirectRunner” is being used. Other possible options: DataflowRunner, SparkRunner, FlinkRunner, PortableRunner. See:
BeamRunnerType
See: https://beam.apache.org/documentation/runners/capability-matrix/default_pipeline_options (dict | None) – Map of default pipeline options.
pipeline_options (dict | None) –
Map of pipeline options.The key must be a dictionary. The value can contain different types:
If the value is None, the single option -
--key
(without value) will be added.If the value is False, this option will be skipped
If the value is True, the single option -
--key
(without value) will be added.If the value is list, the many options will be added for each key. If the value is
['A', 'B']
and the key iskey
then the--key=A --key=B
options will be leftOther value types will be replaced with the Python textual representation.
When defining labels (labels option), you can also provide a dictionary.
gcp_conn_id (str) – Optional. The connection ID to use connecting to Google Cloud Storage if python file is on GCS.
dataflow_config (airflow.providers.google.cloud.operators.dataflow.DataflowConfiguration | dict | None) – Dataflow’s configuration, used when runner type is set to DataflowRunner, (optional) defaults to None.
- class airflow.providers.apache.beam.operators.beam.BeamRunPythonPipelineOperator(*, py_file, runner='DirectRunner', default_pipeline_options=None, pipeline_options=None, py_interpreter='python3', py_options=None, py_requirements=None, py_system_site_packages=False, gcp_conn_id='google_cloud_default', dataflow_config=None, deferrable=conf.getboolean('operators', 'default_deferrable', fallback=False), **kwargs)[source]¶
Bases:
BeamBasePipelineOperator
Launch Apache Beam pipelines written in Python.
Note that both
default_pipeline_options
andpipeline_options
will be merged to specify pipeline execution parameter, anddefault_pipeline_options
is expected to save high-level options, for instances, project and zone information, which apply to all beam operators in the DAG.See also
For more information on how to use this operator, take a look at the guide: Run Python Pipelines in Apache Beam
See also
For more detail on Apache Beam have a look at the reference: https://beam.apache.org/documentation/
- Parameters
py_file (str) – Reference to the python Apache Beam pipeline file.py, e.g., /some/local/file/path/to/your/python/pipeline/file. (templated)
py_options (list[str] | None) – Additional python options, e.g., [“-m”, “-v”].
py_interpreter (str) – Python version of the beam pipeline. If None, this defaults to the python3. To track python versions supported by beam and related issues check: https://issues.apache.org/jira/browse/BEAM-1251
py_requirements (list[str] | None) –
Additional python package(s) to install. If a value is passed to this parameter, a new virtual environment has been created with additional packages installed.
You could also install the apache_beam package if it is not installed on your system or you want to use a different version.
py_system_site_packages (bool) – Whether to include system_site_packages in your virtualenv. See virtualenv documentation for more information. This option is only relevant if the
py_requirements
parameter is not None.deferrable (bool) – Run operator in the deferrable mode: checks for the state using asynchronous calls.
- class airflow.providers.apache.beam.operators.beam.BeamRunJavaPipelineOperator(*, jar, runner='DirectRunner', job_class=None, default_pipeline_options=None, pipeline_options=None, gcp_conn_id='google_cloud_default', dataflow_config=None, deferrable=conf.getboolean('operators', 'default_deferrable', fallback=False), **kwargs)[source]¶
Bases:
BeamBasePipelineOperator
Launching Apache Beam pipelines written in Java.
Note that both
default_pipeline_options
andpipeline_options
will be merged to specify pipeline execution parameter, anddefault_pipeline_options
is expected to save high-level pipeline_options, for instances, project and zone information, which apply to all Apache Beam operators in the DAG.See also
For more information on how to use this operator, take a look at the guide: Run Java Pipelines in Apache Beam
See also
For more detail on Apache Beam have a look at the reference: https://beam.apache.org/documentation/
You need to pass the path to your jar file as a file reference with the
jar
parameter, the jar needs to be a self executing jar (see documentation here: https://beam.apache.org/documentation/runners/dataflow/#self-executing-jar). Usepipeline_options
to pass on pipeline_options to your job.- Parameters
- class airflow.providers.apache.beam.operators.beam.BeamRunGoPipelineOperator(*, go_file='', launcher_binary='', worker_binary='', runner='DirectRunner', default_pipeline_options=None, pipeline_options=None, gcp_conn_id='google_cloud_default', dataflow_config=None, **kwargs)[source]¶
Bases:
BeamBasePipelineOperator
Launch Apache Beam pipelines written in Go.
Note that both
default_pipeline_options
andpipeline_options
will be merged to specify pipeline execution parameter, anddefault_pipeline_options
is expected to save high-level options, for instances, project and zone information, which apply to all beam operators in the DAG.See also
For more information on how to use this operator, take a look at the guide: Run Go Pipelines in Apache Beam
See also
For more detail on Apache Beam have a look at the reference: https://beam.apache.org/documentation/
- Parameters
go_file (str) – Reference to the Apache Beam pipeline Go source file, e.g. /local/path/to/main.go or gs://bucket/path/to/main.go. Exactly one of go_file and launcher_binary must be provided.
launcher_binary (str) – Reference to the Apache Beam pipeline Go binary compiled for the launching platform, e.g. /local/path/to/launcher-main or gs://bucket/path/to/launcher-main. Exactly one of go_file and launcher_binary must be provided.
worker_binary (str) – Reference to the Apache Beam pipeline Go binary compiled for the worker platform, e.g. /local/path/to/worker-main or gs://bucket/path/to/worker-main. Needed if the OS or architecture of the workers running the pipeline is different from that of the platform launching the pipeline. For more information, see the Apache Beam documentation for Go cross compilation: https://beam.apache.org/documentation/sdks/go-cross-compilation/. If launcher_binary is not set, providing a worker_binary will have no effect. If launcher_binary is set and worker_binary is not, worker_binary will default to the value of launcher_binary.