airflow.providers.apache.beam.hooks.beam

This module contains a Apache Beam Hook.

Module Contents

class airflow.providers.apache.beam.hooks.beam.BeamRunnerType[source]

Helper class for listing runner types. For more information about runners see: https://beam.apache.org/documentation/

DataflowRunner = DataflowRunner[source]
DirectRunner = DirectRunner[source]
SparkRunner = SparkRunner[source]
FlinkRunner = FlinkRunner[source]
SamzaRunner = SamzaRunner[source]
NemoRunner = NemoRunner[source]
JetRunner = JetRunner[source]
Twister2Runner = Twister2Runner[source]
airflow.providers.apache.beam.hooks.beam.beam_options_to_args(options: dict) → List[str][source]
Returns a formatted pipeline options from a dictionary of arguments

The logic of this method should be compatible with Apache Beam: https://github.com/apache/beam/blob/b56740f0e8cd80c2873412847d0b336837429fb9/sdks/python/ apache_beam/options/pipeline_options.py#L230-L251

Parameters

options (dict) -- Dictionary with options

Returns

List of arguments

Return type

List[str]

class airflow.providers.apache.beam.hooks.beam.BeamCommandRunner(cmd: List[str], process_line_callback: Optional[Callable[[str], None]] = None)[source]

Bases: airflow.utils.log.logging_mixin.LoggingMixin

Class responsible for running pipeline command in subprocess

Parameters
  • cmd (List[str]) -- Parts of the command to be run in subprocess

  • process_line_callback (Optional[Callable[[str], None]]) -- Optional callback which can be used to process stdout and stderr to detect job id

wait_for_done(self)[source]

Waits for Apache Beam pipeline to complete.

class airflow.providers.apache.beam.hooks.beam.BeamHook(runner: str)[source]

Bases: airflow.hooks.base.BaseHook

Hook for Apache Beam.

All the methods in the hook where project_id is used must be called with keyword arguments rather than positional.

Parameters

runner (str) -- Runner type

start_python_pipeline(self, variables: dict, py_file: str, py_options: List[str], py_interpreter: str = 'python3', py_requirements: Optional[List[str]] = None, py_system_site_packages: bool = False, process_line_callback: Optional[Callable[[str], None]] = None)[source]

Starts Apache Beam python pipeline.

Parameters
  • variables (Dict) -- Variables passed to the pipeline.

  • py_options (List[str]) -- Additional options.

  • py_interpreter (str) -- Python version of the Apache Beam pipeline. If None, this defaults to the python3. To track python versions supported by beam and related issues check: https://issues.apache.org/jira/browse/BEAM-1251

  • py_requirements (List[str]) --

    Additional python package(s) to install. If a value is passed to this parameter, a new virtual environment has been created with additional packages installed.

    You could also install the apache-beam package if it is not installed on your system or you want to use a different version.

  • py_system_site_packages (bool) --

    Whether to include system_site_packages in your virtualenv. See virtualenv documentation for more information.

    This option is only relevant if the py_requirements parameter is not None.

  • on_new_job_id_callback (callable) -- Callback called when the job ID is known.

start_java_pipeline(self, variables: dict, jar: str, job_class: Optional[str] = None, process_line_callback: Optional[Callable[[str], None]] = None)[source]

Starts Apache Beam Java pipeline.

Parameters
  • variables (dict) -- Variables passed to the job.

  • jar -- Name of the jar for the pipeline

  • job_class (str) -- Name of the java class for the pipeline.

Was this entry helpful?