airflow.providers.apache.beam.operators.beam
¶
This module contains Apache Beam operators.
Module Contents¶
-
class
airflow.providers.apache.beam.operators.beam.
BeamDataflowMixin
[source]¶ Helper class to store common, Dataflow specific logic for both
BeamRunPythonPipelineOperator
andBeamRunJavaPipelineOperator
.
-
class
airflow.providers.apache.beam.operators.beam.
BeamRunPythonPipelineOperator
(*, py_file: str, runner: str = 'DirectRunner', default_pipeline_options: Optional[dict] = None, pipeline_options: Optional[dict] = None, py_interpreter: str = 'python3', py_options: Optional[List[str]] = None, py_requirements: Optional[List[str]] = None, py_system_site_packages: bool = False, gcp_conn_id: str = 'google_cloud_default', delegate_to: Optional[str] = None, dataflow_config: Optional[Union[DataflowConfiguration, dict]] = None, **kwargs)[source]¶ Bases:
airflow.models.BaseOperator
,airflow.providers.apache.beam.operators.beam.BeamDataflowMixin
Launching Apache Beam pipelines written in Python. Note that both
default_pipeline_options
andpipeline_options
will be merged to specify pipeline execution parameter, anddefault_pipeline_options
is expected to save high-level options, for instances, project and zone information, which apply to all beam operators in the DAG.See also
For more information on how to use this operator, take a look at the guide: Run Python Pipelines in Apache Beam
See also
For more detail on Apache Beam have a look at the reference: https://beam.apache.org/documentation/
- Parameters
py_file (str) -- Reference to the python Apache Beam pipeline file.py, e.g., /some/local/file/path/to/your/python/pipeline/file. (templated)
runner (str) -- Runner on which pipeline will be run. By default "DirectRunner" is being used. Other possible options: DataflowRunner, SparkRunner, FlinkRunner. See:
BeamRunnerType
See: https://beam.apache.org/documentation/runners/capability-matrix/py_options (list[str]) -- Additional python options, e.g., ["-m", "-v"].
default_pipeline_options (dict) -- Map of default pipeline options.
pipeline_options (dict) --
Map of pipeline options.The key must be a dictionary. The value can contain different types:
If the value is None, the single option -
--key
(without value) will be added.If the value is False, this option will be skipped
If the value is True, the single option -
--key
(without value) will be added.If the value is list, the many options will be added for each key. If the value is
['A', 'B']
and the key iskey
then the--key=A --key-B
options will be leftOther value types will be replaced with the Python textual representation.
When defining labels (
labels
option), you can also provide a dictionary.py_interpreter (str) -- Python version of the beam pipeline. If None, this defaults to the python3. To track python versions supported by beam and related issues check: https://issues.apache.org/jira/browse/BEAM-1251
py_requirements (List[str]) --
Additional python package(s) to install. If a value is passed to this parameter, a new virtual environment has been created with additional packages installed.
You could also install the apache_beam package if it is not installed on your system or you want to use a different version.
py_system_site_packages --
Whether to include system_site_packages in your virtualenv. See virtualenv documentation for more information.
This option is only relevant if the
py_requirements
parameter is not None.gcp_conn_id (str) -- Optional. The connection ID to use connecting to Google Cloud Storage if python file is on GCS.
delegate_to (str) -- Optional. The account to impersonate using domain-wide delegation of authority, if any. For this to work, the service account making the request must have domain-wide delegation enabled.
dataflow_config (Union[dict, providers.google.cloud.operators.dataflow.DataflowConfiguration]) -- Dataflow configuration, used when runner type is set to DataflowRunner
-
class
airflow.providers.apache.beam.operators.beam.
BeamRunJavaPipelineOperator
(*, jar: str, runner: str = 'DirectRunner', job_class: Optional[str] = None, default_pipeline_options: Optional[dict] = None, pipeline_options: Optional[dict] = None, gcp_conn_id: str = 'google_cloud_default', delegate_to: Optional[str] = None, dataflow_config: Optional[Union[DataflowConfiguration, dict]] = None, **kwargs)[source]¶ Bases:
airflow.models.BaseOperator
,airflow.providers.apache.beam.operators.beam.BeamDataflowMixin
Launching Apache Beam pipelines written in Java.
Note that both
default_pipeline_options
andpipeline_options
will be merged to specify pipeline execution parameter, anddefault_pipeline_options
is expected to save high-level pipeline_options, for instances, project and zone information, which apply to all Apache Beam operators in the DAG.See also
For more information on how to use this operator, take a look at the guide: Run Java Pipelines in Apache Beam
See also
For more detail on Apache Beam have a look at the reference: https://beam.apache.org/documentation/
You need to pass the path to your jar file as a file reference with the
jar
parameter, the jar needs to be a self executing jar (see documentation here: https://beam.apache.org/documentation/runners/dataflow/#self-executing-jar). Usepipeline_options
to pass on pipeline_options to your job.- Parameters
jar (str) -- The reference to a self executing Apache Beam jar (templated).
runner (str) -- Runner on which pipeline will be run. By default "DirectRunner" is being used. See: https://beam.apache.org/documentation/runners/capability-matrix/
job_class (str) -- The name of the Apache Beam pipeline class to be executed, it is often not the main class configured in the pipeline jar file.
default_pipeline_options (dict) -- Map of default job pipeline_options.
pipeline_options (dict) --
Map of job specific pipeline_options.The key must be a dictionary. The value can contain different types:
If the value is None, the single option -
--key
(without value) will be added.If the value is False, this option will be skipped
If the value is True, the single option -
--key
(without value) will be added.If the value is list, the many pipeline_options will be added for each key. If the value is
['A', 'B']
and the key iskey
then the--key=A --key-B
pipeline_options will be leftOther value types will be replaced with the Python textual representation.
When defining labels (
labels
option), you can also provide a dictionary.gcp_conn_id (str) -- The connection ID to use connecting to Google Cloud Storage if jar is on GCS
delegate_to (str) -- The account to impersonate using domain-wide delegation of authority, if any. For this to work, the service account making the request must have domain-wide delegation enabled.
dataflow_config (Union[dict, providers.google.cloud.operators.dataflow.DataflowConfiguration]) -- Dataflow configuration, used when runner type is set to DataflowRunner