airflow.providers.apache.beam.operators.beam

This module contains Apache Beam operators.

Module Contents

class airflow.providers.apache.beam.operators.beam.BeamDataflowMixin[source]

Helper class to store common, Dataflow specific logic for both BeamRunPythonPipelineOperator and BeamRunJavaPipelineOperator.

dataflow_hook :Optional[DataflowHook][source]
dataflow_config :Optional[DataflowConfiguration][source]
class airflow.providers.apache.beam.operators.beam.BeamRunPythonPipelineOperator(*, py_file: str, runner: str = 'DirectRunner', default_pipeline_options: Optional[dict] = None, pipeline_options: Optional[dict] = None, py_interpreter: str = 'python3', py_options: Optional[List[str]] = None, py_requirements: Optional[List[str]] = None, py_system_site_packages: bool = False, gcp_conn_id: str = 'google_cloud_default', delegate_to: Optional[str] = None, dataflow_config: Optional[Union[DataflowConfiguration, dict]] = None, **kwargs)[source]

Bases: airflow.models.BaseOperator, airflow.providers.apache.beam.operators.beam.BeamDataflowMixin

Launching Apache Beam pipelines written in Python. Note that both default_pipeline_options and pipeline_options will be merged to specify pipeline execution parameter, and default_pipeline_options is expected to save high-level options, for instances, project and zone information, which apply to all beam operators in the DAG.

See also

For more information on how to use this operator, take a look at the guide: Run Python Pipelines in Apache Beam

See also

For more detail on Apache Beam have a look at the reference: https://beam.apache.org/documentation/

Parameters
  • py_file (str) -- Reference to the python Apache Beam pipeline file.py, e.g., /some/local/file/path/to/your/python/pipeline/file. (templated)

  • runner (str) -- Runner on which pipeline will be run. By default "DirectRunner" is being used. Other possible options: DataflowRunner, SparkRunner, FlinkRunner. See: BeamRunnerType See: https://beam.apache.org/documentation/runners/capability-matrix/

  • py_options (list[str]) -- Additional python options, e.g., ["-m", "-v"].

  • default_pipeline_options (dict) -- Map of default pipeline options.

  • pipeline_options (dict) --

    Map of pipeline options.The key must be a dictionary. The value can contain different types:

    • If the value is None, the single option - --key (without value) will be added.

    • If the value is False, this option will be skipped

    • If the value is True, the single option - --key (without value) will be added.

    • If the value is list, the many options will be added for each key. If the value is ['A', 'B'] and the key is key then the --key=A --key-B options will be left

    • Other value types will be replaced with the Python textual representation.

    When defining labels (labels option), you can also provide a dictionary.

  • py_interpreter (str) -- Python version of the beam pipeline. If None, this defaults to the python3. To track python versions supported by beam and related issues check: https://issues.apache.org/jira/browse/BEAM-1251

  • py_requirements (List[str]) --

    Additional python package(s) to install. If a value is passed to this parameter, a new virtual environment has been created with additional packages installed.

    You could also install the apache_beam package if it is not installed on your system or you want to use a different version.

  • py_system_site_packages --

    Whether to include system_site_packages in your virtualenv. See virtualenv documentation for more information.

    This option is only relevant if the py_requirements parameter is not None.

  • gcp_conn_id (str) -- Optional. The connection ID to use connecting to Google Cloud Storage if python file is on GCS.

  • delegate_to (str) -- Optional. The account to impersonate using domain-wide delegation of authority, if any. For this to work, the service account making the request must have domain-wide delegation enabled.

  • dataflow_config (Union[dict, providers.google.cloud.operators.dataflow.DataflowConfiguration]) -- Dataflow configuration, used when runner type is set to DataflowRunner

template_fields = ['py_file', 'runner', 'pipeline_options', 'default_pipeline_options', 'dataflow_config'][source]
template_fields_renderers[source]
execute(self, context)[source]

Execute the Apache Beam Pipeline.

on_kill(self)[source]
class airflow.providers.apache.beam.operators.beam.BeamRunJavaPipelineOperator(*, jar: str, runner: str = 'DirectRunner', job_class: Optional[str] = None, default_pipeline_options: Optional[dict] = None, pipeline_options: Optional[dict] = None, gcp_conn_id: str = 'google_cloud_default', delegate_to: Optional[str] = None, dataflow_config: Optional[Union[DataflowConfiguration, dict]] = None, **kwargs)[source]

Bases: airflow.models.BaseOperator, airflow.providers.apache.beam.operators.beam.BeamDataflowMixin

Launching Apache Beam pipelines written in Java.

Note that both default_pipeline_options and pipeline_options will be merged to specify pipeline execution parameter, and default_pipeline_options is expected to save high-level pipeline_options, for instances, project and zone information, which apply to all Apache Beam operators in the DAG.

See also

For more information on how to use this operator, take a look at the guide: Run Java Pipelines in Apache Beam

See also

For more detail on Apache Beam have a look at the reference: https://beam.apache.org/documentation/

You need to pass the path to your jar file as a file reference with the jar parameter, the jar needs to be a self executing jar (see documentation here: https://beam.apache.org/documentation/runners/dataflow/#self-executing-jar). Use pipeline_options to pass on pipeline_options to your job.

Parameters
  • jar (str) -- The reference to a self executing Apache Beam jar (templated).

  • runner (str) -- Runner on which pipeline will be run. By default "DirectRunner" is being used. See: https://beam.apache.org/documentation/runners/capability-matrix/

  • job_class (str) -- The name of the Apache Beam pipeline class to be executed, it is often not the main class configured in the pipeline jar file.

  • default_pipeline_options (dict) -- Map of default job pipeline_options.

  • pipeline_options (dict) --

    Map of job specific pipeline_options.The key must be a dictionary. The value can contain different types:

    • If the value is None, the single option - --key (without value) will be added.

    • If the value is False, this option will be skipped

    • If the value is True, the single option - --key (without value) will be added.

    • If the value is list, the many pipeline_options will be added for each key. If the value is ['A', 'B'] and the key is key then the --key=A --key-B pipeline_options will be left

    • Other value types will be replaced with the Python textual representation.

    When defining labels (labels option), you can also provide a dictionary.

  • gcp_conn_id (str) -- The connection ID to use connecting to Google Cloud Storage if jar is on GCS

  • delegate_to (str) -- The account to impersonate using domain-wide delegation of authority, if any. For this to work, the service account making the request must have domain-wide delegation enabled.

  • dataflow_config (Union[dict, providers.google.cloud.operators.dataflow.DataflowConfiguration]) -- Dataflow configuration, used when runner type is set to DataflowRunner

template_fields = ['jar', 'runner', 'job_class', 'pipeline_options', 'default_pipeline_options', 'dataflow_config'][source]
template_fields_renderers[source]
ui_color = #0273d4[source]
execute(self, context)[source]

Execute the Apache Beam Pipeline.

on_kill(self)[source]

Was this entry helpful?