airflow.providers.apache.beam.hooks.beam

This module contains a Apache Beam Hook.

Module Contents

Classes

BeamRunnerType

Helper class for listing runner types.

BeamHook

Hook for Apache Beam.

BeamAsyncHook

Asynchronous hook for Apache Beam.

Functions

beam_options_to_args(options)

Return a formatted pipeline options from a dictionary of arguments.

process_fd(proc, fd, log[, process_line_callback, ...])

Print output to logs.

run_beam_command(cmd, log[, process_line_callback, ...])

Run pipeline command in subprocess.

class airflow.providers.apache.beam.hooks.beam.BeamRunnerType[source]

Helper class for listing runner types.

For more information about runners see: https://beam.apache.org/documentation/

DataflowRunner = 'DataflowRunner'[source]
DirectRunner = 'DirectRunner'[source]
SparkRunner = 'SparkRunner'[source]
FlinkRunner = 'FlinkRunner'[source]
SamzaRunner = 'SamzaRunner'[source]
NemoRunner = 'NemoRunner'[source]
JetRunner = 'JetRunner'[source]
Twister2Runner = 'Twister2Runner'[source]
airflow.providers.apache.beam.hooks.beam.beam_options_to_args(options)[source]

Return a formatted pipeline options from a dictionary of arguments.

The logic of this method should be compatible with Apache Beam: https://github.com/apache/beam/blob/b56740f0e8cd80c2873412847d0b336837429fb9/sdks/python/ apache_beam/options/pipeline_options.py#L230-L251

Parameters

options (dict) – Dictionary with options

Returns

List of arguments

Return type

list[str]

airflow.providers.apache.beam.hooks.beam.process_fd(proc, fd, log, process_line_callback=None, check_job_status_callback=None)[source]

Print output to logs.

Parameters
  • proc – subprocess.

  • fd – File descriptor.

  • process_line_callback (Callable[[str], None] | None) – Optional callback which can be used to process stdout and stderr to detect job id.

  • log (logging.Logger) – logger.

airflow.providers.apache.beam.hooks.beam.run_beam_command(cmd, log, process_line_callback=None, working_directory=None, check_job_status_callback=None)[source]

Run pipeline command in subprocess.

Parameters
  • cmd (list[str]) – Parts of the command to be run in subprocess

  • process_line_callback (Callable[[str], None] | None) – Optional callback which can be used to process stdout and stderr to detect job id

  • working_directory (str | None) – Working directory

  • log (logging.Logger) – logger.

class airflow.providers.apache.beam.hooks.beam.BeamHook(runner)[source]

Bases: airflow.hooks.base.BaseHook

Hook for Apache Beam.

All the methods in the hook where project_id is used must be called with keyword arguments rather than positional.

Parameters

runner (str) – Runner type

start_python_pipeline(variables, py_file, py_options, py_interpreter='python3', py_requirements=None, py_system_site_packages=False, process_line_callback=None, check_job_status_callback=None)[source]

Start Apache Beam python pipeline.

Parameters
  • variables (dict) – Variables passed to the pipeline.

  • py_file (str) – Path to the python file to execute.

  • py_options (list[str]) – Additional options.

  • py_interpreter (str) – Python version of the Apache Beam pipeline. If None, this defaults to the python3. To track python versions supported by beam and related issues check: https://issues.apache.org/jira/browse/BEAM-1251

  • py_requirements (list[str] | None) –

    Additional python package(s) to install. If a value is passed to this parameter, a new virtual environment has been created with additional packages installed.

    You could also install the apache-beam package if it is not installed on your system, or you want to use a different version.

  • py_system_site_packages (bool) –

    Whether to include system_site_packages in your virtualenv. See virtualenv documentation for more information.

    This option is only relevant if the py_requirements parameter is not None.

  • process_line_callback (Callable[[str], None] | None) – (optional) Callback that can be used to process each line of the stdout and stderr file descriptors.

start_java_pipeline(variables, jar, job_class=None, process_line_callback=None)[source]

Start Apache Beam Java pipeline.

Parameters
  • variables (dict) – Variables passed to the job.

  • jar (str) – Name of the jar for the pipeline

  • job_class (str | None) – Name of the java class for the pipeline.

  • process_line_callback (Callable[[str], None] | None) – (optional) Callback that can be used to process each line of the stdout and stderr file descriptors.

start_go_pipeline(variables, go_file, process_line_callback=None, should_init_module=False)[source]

Start Apache Beam Go pipeline with a source file.

Parameters
  • variables (dict) – Variables passed to the job.

  • go_file (str) – Path to the Go file with your beam pipeline.

  • process_line_callback (Callable[[str], None] | None) – (optional) Callback that can be used to process each line of the stdout and stderr file descriptors.

  • should_init_module (bool) – If False (default), will just execute a go run command. If True, will init a module and dependencies with a go mod init and go mod tidy, useful when pulling source with GCSHook.

Returns

Return type

None

start_go_pipeline_with_binary(variables, launcher_binary, worker_binary, process_line_callback=None)[source]

Start Apache Beam Go pipeline with an executable binary.

Parameters
  • variables (dict) – Variables passed to the job.

  • launcher_binary (str) – Path to the binary compiled for the launching platform.

  • worker_binary (str) – Path to the binary compiled for the worker platform.

  • process_line_callback (Callable[[str], None] | None) – (optional) Callback that can be used to process each line of the stdout and stderr file descriptors.

class airflow.providers.apache.beam.hooks.beam.BeamAsyncHook(runner)[source]

Bases: BeamHook

Asynchronous hook for Apache Beam.

Parameters

runner (str) – Runner type.

async start_python_pipeline_async(variables, py_file, py_options=None, py_interpreter='python3', py_requirements=None, py_system_site_packages=False)[source]

Start Apache Beam python pipeline.

Parameters
  • variables (dict) – Variables passed to the pipeline.

  • py_file (str) – Path to the python file to execute.

  • py_options (list[str] | None) – Additional options.

  • py_interpreter (str) – Python version of the Apache Beam pipeline. If None, this defaults to the python3. To track python versions supported by beam and related issues check: https://issues.apache.org/jira/browse/BEAM-1251

  • py_requirements (list[str] | None) – Additional python package(s) to install. If a value is passed to this parameter, a new virtual environment has been created with additional packages installed. You could also install the apache-beam package if it is not installed on your system, or you want to use a different version.

  • py_system_site_packages (bool) – Whether to include system_site_packages in your virtualenv. See virtualenv documentation for more information. This option is only relevant if the py_requirements parameter is not None.

async start_java_pipeline_async(variables, jar, job_class=None)[source]

Start Apache Beam Java pipeline.

Parameters
  • variables (dict) – Variables passed to the job.

  • jar (str) – Name of the jar for the pipeline.

  • job_class (str | None) – Name of the java class for the pipeline.

Returns

Beam command execution return code.

async start_pipeline_async(variables, command_prefix, working_directory=None)[source]
async run_beam_command_async(cmd, log, working_directory=None)[source]

Run pipeline command in subprocess.

Parameters
  • cmd (list[str]) – Parts of the command to be run in subprocess

  • working_directory (str | None) – Working directory

  • log (logging.Logger) – logger.

async read_logs(stream_reader)[source]

Was this entry helpful?