airflow.providers.apache.beam.operators.beam
¶
This module contains Apache Beam operators.
Module Contents¶
Classes¶
Helper class to store common, Dataflow specific logic for both |
|
Abstract base class for Beam Pipeline Operators. |
|
Launching Apache Beam pipelines written in Python. Note that both |
|
Launching Apache Beam pipelines written in Java. |
|
Launching Apache Beam pipelines written in Go. Note that both |
- class airflow.providers.apache.beam.operators.beam.BeamDataflowMixin[source]¶
Helper class to store common, Dataflow specific logic for both
BeamRunPythonPipelineOperator
,BeamRunJavaPipelineOperator
andBeamRunGoPipelineOperator
.
- class airflow.providers.apache.beam.operators.beam.BeamBasePipelineOperator(*, runner='DirectRunner', default_pipeline_options=None, pipeline_options=None, gcp_conn_id='google_cloud_default', delegate_to=None, dataflow_config=None, **kwargs)[source]¶
Bases:
airflow.models.BaseOperator
,BeamDataflowMixin
,abc.ABC
Abstract base class for Beam Pipeline Operators.
- Parameters
runner (str) -- Runner on which pipeline will be run. By default "DirectRunner" is being used. Other possible options: DataflowRunner, SparkRunner, FlinkRunner, PortableRunner. See:
BeamRunnerType
See: https://beam.apache.org/documentation/runners/capability-matrix/default_pipeline_options (Optional[dict]) -- Map of default pipeline options.
pipeline_options (Optional[dict]) --
Map of pipeline options.The key must be a dictionary. The value can contain different types:
If the value is None, the single option -
--key
(without value) will be added.If the value is False, this option will be skipped
If the value is True, the single option -
--key
(without value) will be added.If the value is list, the many options will be added for each key. If the value is
['A', 'B']
and the key iskey
then the--key=A --key=B
options will be leftOther value types will be replaced with the Python textual representation.
When defining labels (labels option), you can also provide a dictionary.
gcp_conn_id (str) -- Optional. The connection ID to use connecting to Google Cloud Storage if python file is on GCS.
delegate_to (Optional[str]) -- Optional. The account to impersonate using domain-wide delegation of authority, if any. For this to work, the service account making the request must have domain-wide delegation enabled.
dataflow_config (Optional[Union[airflow.providers.google.cloud.operators.dataflow.DataflowConfiguration, dict]]) -- Dataflow configuration, used when runner type is set to DataflowRunner, (optional) defaults to None.
- class airflow.providers.apache.beam.operators.beam.BeamRunPythonPipelineOperator(*, py_file, runner='DirectRunner', default_pipeline_options=None, pipeline_options=None, py_interpreter='python3', py_options=None, py_requirements=None, py_system_site_packages=False, gcp_conn_id='google_cloud_default', delegate_to=None, dataflow_config=None, **kwargs)[source]¶
Bases:
BeamBasePipelineOperator
Launching Apache Beam pipelines written in Python. Note that both
default_pipeline_options
andpipeline_options
will be merged to specify pipeline execution parameter, anddefault_pipeline_options
is expected to save high-level options, for instances, project and zone information, which apply to all beam operators in the DAG.See also
For more information on how to use this operator, take a look at the guide: Run Python Pipelines in Apache Beam
See also
For more detail on Apache Beam have a look at the reference: https://beam.apache.org/documentation/
- Parameters
py_file (str) -- Reference to the python Apache Beam pipeline file.py, e.g., /some/local/file/path/to/your/python/pipeline/file. (templated)
py_options (Optional[List[str]]) -- Additional python options, e.g., ["-m", "-v"].
py_interpreter (str) -- Python version of the beam pipeline. If None, this defaults to the python3. To track python versions supported by beam and related issues check: https://issues.apache.org/jira/browse/BEAM-1251
py_requirements (Optional[List[str]]) --
Additional python package(s) to install. If a value is passed to this parameter, a new virtual environment has been created with additional packages installed.
You could also install the apache_beam package if it is not installed on your system or you want to use a different version.
py_system_site_packages (bool) --
Whether to include system_site_packages in your virtualenv. See virtualenv documentation for more information.
This option is only relevant if the
py_requirements
parameter is not None.
- class airflow.providers.apache.beam.operators.beam.BeamRunJavaPipelineOperator(*, jar, runner='DirectRunner', job_class=None, default_pipeline_options=None, pipeline_options=None, gcp_conn_id='google_cloud_default', delegate_to=None, dataflow_config=None, **kwargs)[source]¶
Bases:
BeamBasePipelineOperator
Launching Apache Beam pipelines written in Java.
Note that both
default_pipeline_options
andpipeline_options
will be merged to specify pipeline execution parameter, anddefault_pipeline_options
is expected to save high-level pipeline_options, for instances, project and zone information, which apply to all Apache Beam operators in the DAG.See also
For more information on how to use this operator, take a look at the guide: Run Java Pipelines in Apache Beam
See also
For more detail on Apache Beam have a look at the reference: https://beam.apache.org/documentation/
You need to pass the path to your jar file as a file reference with the
jar
parameter, the jar needs to be a self executing jar (see documentation here: https://beam.apache.org/documentation/runners/dataflow/#self-executing-jar). Usepipeline_options
to pass on pipeline_options to your job.- Parameters
- class airflow.providers.apache.beam.operators.beam.BeamRunGoPipelineOperator(*, go_file, runner='DirectRunner', default_pipeline_options=None, pipeline_options=None, gcp_conn_id='google_cloud_default', delegate_to=None, dataflow_config=None, **kwargs)[source]¶
Bases:
BeamBasePipelineOperator
Launching Apache Beam pipelines written in Go. Note that both
default_pipeline_options
andpipeline_options
will be merged to specify pipeline execution parameter, anddefault_pipeline_options
is expected to save high-level options, for instances, project and zone information, which apply to all beam operators in the DAG.See also
For more information on how to use this operator, take a look at the guide: Run Go Pipelines in Apache Beam
See also
For more detail on Apache Beam have a look at the reference: https://beam.apache.org/documentation/
- Parameters
go_file (str) -- Reference to the Go Apache Beam pipeline e.g., /some/local/file/path/to/your/go/pipeline/file.go