`airflow.providers.apache.beam.operators.beam`¶

This module contains Apache Beam operators.

Module Contents¶

class airflow.providers.apache.beam.operators.beam.BeamDataflowMixin[source]¶

Helper class to store common, Dataflow specific logic for both BeamRunPythonPipelineOperator and BeamRunJavaPipelineOperator.

dataflow_hook :Optional[DataflowHook][source]¶

dataflow_config :Optional[DataflowConfiguration][source]¶

class airflow.providers.apache.beam.operators.beam.BeamRunPythonPipelineOperator(*, py_file: str, runner: str = 'DirectRunner', default_pipeline_options: Optional[dict] = None, pipeline_options: Optional[dict] = None, py_interpreter: str = 'python3', py_options: Optional[List[str]] = None, py_requirements: Optional[List[str]] = None, py_system_site_packages: bool = False, gcp_conn_id: str = 'google_cloud_default', delegate_to: Optional[str] = None, dataflow_config: Optional[Union[DataflowConfiguration, dict]] = None, **kwargs)[source]¶

Bases: airflow.models.BaseOperator, airflow.providers.apache.beam.operators.beam.BeamDataflowMixin

Launching Apache Beam pipelines written in Python. Note that both default_pipeline_options and pipeline_options will be merged to specify pipeline execution parameter, and default_pipeline_options is expected to save high-level options, for instances, project and zone information, which apply to all beam operators in the DAG.

See also

For more information on how to use this operator, take a look at the guide: Run Python Pipelines in Apache Beam

See also

For more detail on Apache Beam have a look at the reference: https://beam.apache.org/documentation/

Parameters

py_file (str) -- Reference to the python Apache Beam pipeline file.py, e.g., /some/local/file/path/to/your/python/pipeline/file. (templated)
runner (str) -- Runner on which pipeline will be run. By default "DirectRunner" is being used. Other possible options: DataflowRunner, SparkRunner, FlinkRunner. See: BeamRunnerType See: https://beam.apache.org/documentation/runners/capability-matrix/
py_options (list[str]) -- Additional python options, e.g., ["-m", "-v"].
default_pipeline_options (dict) -- Map of default pipeline options.
pipeline_options (dict) --
Map of pipeline options.The key must be a dictionary. The value can contain different types:
- If the value is None, the single option - --key (without value) will be added.
- If the value is False, this option will be skipped
- If the value is True, the single option - --key (without value) will be added.
- If the value is list, the many options will be added for each key. If the value is ['A', 'B'] and the key is key then the --key=A --key-B options will be left
- Other value types will be replaced with the Python textual representation.
When defining labels (labels option), you can also provide a dictionary.
py_interpreter (str) -- Python version of the beam pipeline. If None, this defaults to the python3. To track python versions supported by beam and related issues check: https://issues.apache.org/jira/browse/BEAM-1251
py_requirements (List[str]) --
Additional python package(s) to install. If a value is passed to this parameter, a new virtual environment has been created with additional packages installed.

You could also install the apache_beam package if it is not installed on your system or you want to use a different version.
py_system_site_packages --
Whether to include system_site_packages in your virtualenv. See virtualenv documentation for more information.

This option is only relevant if the py_requirements parameter is not None.
gcp_conn_id (str) -- Optional. The connection ID to use connecting to Google Cloud Storage if python file is on GCS.
delegate_to (str) -- Optional. The account to impersonate using domain-wide delegation of authority, if any. For this to work, the service account making the request must have domain-wide delegation enabled.
dataflow_config (Union[dict, providers.google.cloud.operators.dataflow.DataflowConfiguration]) -- Dataflow configuration, used when runner type is set to DataflowRunner

template_fields = ['py_file', 'runner', 'pipeline_options', 'default_pipeline_options', 'dataflow_config'][source]¶

template_fields_renderers[source]¶

execute(self, context)[source]¶: Execute the Apache Beam Pipeline.

on_kill(self)[source]¶

class airflow.providers.apache.beam.operators.beam.BeamRunJavaPipelineOperator(*, jar: str, runner: str = 'DirectRunner', job_class: Optional[str] = None, default_pipeline_options: Optional[dict] = None, pipeline_options: Optional[dict] = None, gcp_conn_id: str = 'google_cloud_default', delegate_to: Optional[str] = None, dataflow_config: Optional[Union[DataflowConfiguration, dict]] = None, **kwargs)[source]¶

Bases: airflow.models.BaseOperator, airflow.providers.apache.beam.operators.beam.BeamDataflowMixin

Launching Apache Beam pipelines written in Java.

Note that both default_pipeline_options and pipeline_options will be merged to specify pipeline execution parameter, and default_pipeline_options is expected to save high-level pipeline_options, for instances, project and zone information, which apply to all Apache Beam operators in the DAG.

See also

For more information on how to use this operator, take a look at the guide: Run Java Pipelines in Apache Beam

See also

For more detail on Apache Beam have a look at the reference: https://beam.apache.org/documentation/

You need to pass the path to your jar file as a file reference with the jar parameter, the jar needs to be a self executing jar (see documentation here: https://beam.apache.org/documentation/runners/dataflow/#self-executing-jar). Use pipeline_options to pass on pipeline_options to your job.

Parameters

jar (str) -- The reference to a self executing Apache Beam jar (templated).
runner (str) -- Runner on which pipeline will be run. By default "DirectRunner" is being used. See: https://beam.apache.org/documentation/runners/capability-matrix/
job_class (str) -- The name of the Apache Beam pipeline class to be executed, it is often not the main class configured in the pipeline jar file.
default_pipeline_options (dict) -- Map of default job pipeline_options.
pipeline_options (dict) --
Map of job specific pipeline_options.The key must be a dictionary. The value can contain different types:
- If the value is None, the single option - --key (without value) will be added.
- If the value is False, this option will be skipped
- If the value is True, the single option - --key (without value) will be added.
- If the value is list, the many pipeline_options will be added for each key. If the value is ['A', 'B'] and the key is key then the --key=A --key-B pipeline_options will be left
- Other value types will be replaced with the Python textual representation.
When defining labels (labels option), you can also provide a dictionary.
gcp_conn_id (str) -- The connection ID to use connecting to Google Cloud Storage if jar is on GCS
delegate_to (str) -- The account to impersonate using domain-wide delegation of authority, if any. For this to work, the service account making the request must have domain-wide delegation enabled.
dataflow_config (Union[dict, providers.google.cloud.operators.dataflow.DataflowConfiguration]) -- Dataflow configuration, used when runner type is set to DataflowRunner

template_fields = ['jar', 'runner', 'job_class', 'pipeline_options', 'default_pipeline_options', 'dataflow_config'][source]¶

template_fields_renderers[source]¶

ui_color = #0273d4[source]¶

execute(self, context)[source]¶: Execute the Apache Beam Pipeline.

on_kill(self)[source]¶

airflow.providers.apache.beam.operators.beam¶

Module Contents¶

`airflow.providers.apache.beam.operators.beam`¶