airflow.providers.apache.beam.operators.beam

This module contains Apache Beam operators.

Module Contents

Classes

BeamDataflowMixin

Helper class to store common, Dataflow specific logic for both

BeamBasePipelineOperator

Abstract base class for Beam Pipeline Operators.

BeamRunPythonPipelineOperator

Launching Apache Beam pipelines written in Python. Note that both

BeamRunJavaPipelineOperator

Launching Apache Beam pipelines written in Java.

BeamRunGoPipelineOperator

Launching Apache Beam pipelines written in Go. Note that both

class airflow.providers.apache.beam.operators.beam.BeamDataflowMixin[source]

Helper class to store common, Dataflow specific logic for both BeamRunPythonPipelineOperator, BeamRunJavaPipelineOperator and BeamRunGoPipelineOperator.

dataflow_hook :Optional[airflow.providers.google.cloud.hooks.dataflow.DataflowHook][source]
dataflow_config :airflow.providers.google.cloud.operators.dataflow.DataflowConfiguration[source]
gcp_conn_id :str[source]
delegate_to :Optional[str][source]
class airflow.providers.apache.beam.operators.beam.BeamBasePipelineOperator(*, runner='DirectRunner', default_pipeline_options=None, pipeline_options=None, gcp_conn_id='google_cloud_default', delegate_to=None, dataflow_config=None, **kwargs)[source]

Bases: airflow.models.BaseOperator, BeamDataflowMixin, abc.ABC

Abstract base class for Beam Pipeline Operators.

Parameters
  • runner (str) -- Runner on which pipeline will be run. By default "DirectRunner" is being used. Other possible options: DataflowRunner, SparkRunner, FlinkRunner, PortableRunner. See: BeamRunnerType See: https://beam.apache.org/documentation/runners/capability-matrix/

  • default_pipeline_options (Optional[dict]) -- Map of default pipeline options.

  • pipeline_options (Optional[dict]) --

    Map of pipeline options.The key must be a dictionary. The value can contain different types:

    • If the value is None, the single option - --key (without value) will be added.

    • If the value is False, this option will be skipped

    • If the value is True, the single option - --key (without value) will be added.

    • If the value is list, the many options will be added for each key. If the value is ['A', 'B'] and the key is key then the --key=A --key=B options will be left

    • Other value types will be replaced with the Python textual representation.

    When defining labels (labels option), you can also provide a dictionary.

  • gcp_conn_id (str) -- Optional. The connection ID to use connecting to Google Cloud Storage if python file is on GCS.

  • delegate_to (Optional[str]) -- Optional. The account to impersonate using domain-wide delegation of authority, if any. For this to work, the service account making the request must have domain-wide delegation enabled.

  • dataflow_config (Optional[Union[airflow.providers.google.cloud.operators.dataflow.DataflowConfiguration, dict]]) -- Dataflow configuration, used when runner type is set to DataflowRunner, (optional) defaults to None.

class airflow.providers.apache.beam.operators.beam.BeamRunPythonPipelineOperator(*, py_file, runner='DirectRunner', default_pipeline_options=None, pipeline_options=None, py_interpreter='python3', py_options=None, py_requirements=None, py_system_site_packages=False, gcp_conn_id='google_cloud_default', delegate_to=None, dataflow_config=None, **kwargs)[source]

Bases: BeamBasePipelineOperator

Launching Apache Beam pipelines written in Python. Note that both default_pipeline_options and pipeline_options will be merged to specify pipeline execution parameter, and default_pipeline_options is expected to save high-level options, for instances, project and zone information, which apply to all beam operators in the DAG.

See also

For more information on how to use this operator, take a look at the guide: Run Python Pipelines in Apache Beam

See also

For more detail on Apache Beam have a look at the reference: https://beam.apache.org/documentation/

Parameters
  • py_file (str) -- Reference to the python Apache Beam pipeline file.py, e.g., /some/local/file/path/to/your/python/pipeline/file. (templated)

  • py_options (Optional[List[str]]) -- Additional python options, e.g., ["-m", "-v"].

  • py_interpreter (str) -- Python version of the beam pipeline. If None, this defaults to the python3. To track python versions supported by beam and related issues check: https://issues.apache.org/jira/browse/BEAM-1251

  • py_requirements (Optional[List[str]]) --

    Additional python package(s) to install. If a value is passed to this parameter, a new virtual environment has been created with additional packages installed.

    You could also install the apache_beam package if it is not installed on your system or you want to use a different version.

  • py_system_site_packages (bool) --

    Whether to include system_site_packages in your virtualenv. See virtualenv documentation for more information.

    This option is only relevant if the py_requirements parameter is not None.

template_fields :Sequence[str] = ['py_file', 'runner', 'pipeline_options', 'default_pipeline_options', 'dataflow_config'][source]
template_fields_renderers[source]
execute(self, context)[source]

Execute the Apache Beam Pipeline.

on_kill(self)[source]

Override this method to cleanup subprocesses when a task instance gets killed. Any use of the threading, subprocess or multiprocessing module within an operator needs to be cleaned up or it will leave ghost processes behind.

class airflow.providers.apache.beam.operators.beam.BeamRunJavaPipelineOperator(*, jar, runner='DirectRunner', job_class=None, default_pipeline_options=None, pipeline_options=None, gcp_conn_id='google_cloud_default', delegate_to=None, dataflow_config=None, **kwargs)[source]

Bases: BeamBasePipelineOperator

Launching Apache Beam pipelines written in Java.

Note that both default_pipeline_options and pipeline_options will be merged to specify pipeline execution parameter, and default_pipeline_options is expected to save high-level pipeline_options, for instances, project and zone information, which apply to all Apache Beam operators in the DAG.

See also

For more information on how to use this operator, take a look at the guide: Run Java Pipelines in Apache Beam

See also

For more detail on Apache Beam have a look at the reference: https://beam.apache.org/documentation/

You need to pass the path to your jar file as a file reference with the jar parameter, the jar needs to be a self executing jar (see documentation here: https://beam.apache.org/documentation/runners/dataflow/#self-executing-jar). Use pipeline_options to pass on pipeline_options to your job.

Parameters
  • jar (str) -- The reference to a self executing Apache Beam jar (templated).

  • job_class (Optional[str]) -- The name of the Apache Beam pipeline class to be executed, it is often not the main class configured in the pipeline jar file.

template_fields :Sequence[str] = ['jar', 'runner', 'job_class', 'pipeline_options', 'default_pipeline_options', 'dataflow_config'][source]
template_fields_renderers[source]
ui_color = #0273d4[source]
execute(self, context)[source]

Execute the Apache Beam Pipeline.

on_kill(self)[source]

Override this method to cleanup subprocesses when a task instance gets killed. Any use of the threading, subprocess or multiprocessing module within an operator needs to be cleaned up or it will leave ghost processes behind.

class airflow.providers.apache.beam.operators.beam.BeamRunGoPipelineOperator(*, go_file, runner='DirectRunner', default_pipeline_options=None, pipeline_options=None, gcp_conn_id='google_cloud_default', delegate_to=None, dataflow_config=None, **kwargs)[source]

Bases: BeamBasePipelineOperator

Launching Apache Beam pipelines written in Go. Note that both default_pipeline_options and pipeline_options will be merged to specify pipeline execution parameter, and default_pipeline_options is expected to save high-level options, for instances, project and zone information, which apply to all beam operators in the DAG.

See also

For more information on how to use this operator, take a look at the guide: Run Go Pipelines in Apache Beam

See also

For more detail on Apache Beam have a look at the reference: https://beam.apache.org/documentation/

Parameters

go_file (str) -- Reference to the Go Apache Beam pipeline e.g., /some/local/file/path/to/your/go/pipeline/file.go

template_fields = ['go_file', 'runner', 'pipeline_options', 'default_pipeline_options', 'dataflow_config'][source]
template_fields_renderers[source]
execute(self, context)[source]

Execute the Apache Beam Pipeline.

on_kill(self)[source]

Override this method to cleanup subprocesses when a task instance gets killed. Any use of the threading, subprocess or multiprocessing module within an operator needs to be cleaned up or it will leave ghost processes behind.

Was this entry helpful?