`airflow.providers.google.cloud.operators.dataflow`¶

This module contains Google Dataflow operators.

Module Contents¶

Classes¶

`CheckJobRunning`	Helper enum for choosing what to do if job is already running.
`DataflowConfiguration`	Dataflow configuration for BeamRunJavaPipelineOperator and BeamRunPythonPipelineOperator.
`DataflowTemplatedJobStartOperator`	Start a Dataflow job with a classic template; the parameters of the operation will be passed to the job.
`DataflowStartFlexTemplateOperator`	Starts a Dataflow Job with a Flex Template.
`DataflowStartSqlJobOperator`	Starts Dataflow SQL query.
`DataflowStartYamlJobOperator`	Launch a Dataflow YAML job and return the result.
`DataflowStopJobOperator`	Stops the job with the specified name prefix or Job ID.
`DataflowCreatePipelineOperator`	Creates a new Dataflow Data Pipeline instance.
`DataflowRunPipelineOperator`	Runs a Dataflow Data Pipeline.
`DataflowDeletePipelineOperator`	Deletes a Dataflow Data Pipeline.

class airflow.providers.google.cloud.operators.dataflow.CheckJobRunning[source]¶

Bases: enum.Enum

Helper enum for choosing what to do if job is already running.

IgnoreJob - do not check if running FinishIfRunning - finish current dag run with no action WaitForRun - wait for job to finish and then continue with new job

IgnoreJob = 1[source]¶

FinishIfRunning = 2[source]¶

WaitForRun = 3[source]¶

class airflow.providers.google.cloud.operators.dataflow.DataflowConfiguration(*, job_name=None, append_job_name=True, project_id=PROVIDE_PROJECT_ID, location=DEFAULT_DATAFLOW_LOCATION, gcp_conn_id='google_cloud_default', poll_sleep=10, impersonation_chain=None, drain_pipeline=False, cancel_timeout=5 * 60, wait_until_finished=None, multiple_jobs=None, check_if_running=CheckJobRunning.WaitForRun, service_account=None)[source]¶

Dataflow configuration for BeamRunJavaPipelineOperator and BeamRunPythonPipelineOperator.

Parameters

job_name (str | None) – The ‘jobName’ to use when executing the Dataflow job (templated). This ends up being set in the pipeline options, so any entry with key 'jobName' or 'job_name'``in ``options will be overwritten.
append_job_name (bool) – True if unique suffix has to be appended to job name.
project_id (str) – Optional, the Google Cloud project ID in which to start a job. If set to None or missing, the default project_id from the Google Cloud connection is used.
location (str | None) – Job location.
gcp_conn_id (str) – The connection ID to use connecting to Google Cloud.
poll_sleep (int) – The time in seconds to sleep between polling Google Cloud Platform for the dataflow job status while the job is in the JOB_STATE_RUNNING state.
impersonation_chain (str | collections.abc.Sequence[str] | None) –
Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

Warning

This option requires Apache Beam 2.39.0 or newer.
drain_pipeline (bool) – Optional, set to True if want to stop streaming job by draining it instead of canceling during killing task instance. See: https://cloud.google.com/dataflow/docs/guides/stopping-a-pipeline
cancel_timeout (int | None) – How long (in seconds) operator should wait for the pipeline to be successfully cancelled when task is being killed. (optional) default to 300s
wait_until_finished (bool | None) –
(Optional) If True, wait for the end of pipeline execution before exiting. If False, only submits job. If None, default behavior.

The default behavior depends on the type of pipeline:
- for the streaming pipeline, wait for jobs to start,
- for the batch pipeline, wait for the jobs to complete.
Warning

You cannot call PipelineResult.wait_until_finish method in your pipeline code for the operator to work properly. i. e. you must use asynchronous execution. Otherwise, your pipeline will always wait until finished. For more information, look at: Asynchronous execution

The process of starting the Dataflow job in Airflow consists of two steps: * running a subprocess and reading the stderr/stderr log for the job id. * loop waiting for the end of the job ID from the previous step by checking its status.

Step two is started just after step one has finished, so if you have wait_until_finished in your pipeline code, step two will not start until the process stops. When this process stops, steps two will run, but it will only execute one iteration as the job will be in a terminal state.

If you in your pipeline do not call the wait_for_pipeline method but pass wait_until_finish=True to the operator, the second loop will wait for the job’s terminal state.

If you in your pipeline do not call the wait_for_pipeline method, and pass wait_until_finish=False to the operator, the second loop will check once is job not in terminal state and exit the loop.
multiple_jobs (bool | None) – If pipeline creates multiple jobs then monitor all jobs. Supported only by BeamRunJavaPipelineOperator.
check_if_running (CheckJobRunning) – Before running job, validate that a previous run is not in process. Supported only by: BeamRunJavaPipelineOperator.
service_account (str | None) – Run the job as a specific service account, instead of the default GCE robot.

template_fields: collections.abc.Sequence[str] = ('job_name', 'location')[source]¶

class airflow.providers.google.cloud.operators.dataflow.DataflowTemplatedJobStartOperator(*, template, project_id=PROVIDE_PROJECT_ID, job_name='{{task.task_id}}', options=None, dataflow_default_options=None, parameters=None, location=None, gcp_conn_id='google_cloud_default', poll_sleep=10, impersonation_chain=None, environment=None, cancel_timeout=10 * 60, wait_until_finished=None, append_job_name=True, deferrable=conf.getboolean('operators', 'default_deferrable', fallback=False), expected_terminal_state=None, **kwargs)[source]¶

Bases: airflow.providers.google.cloud.operators.cloud_base.GoogleCloudBaseOperator

Start a Dataflow job with a classic template; the parameters of the operation will be passed to the job.

See also

For more information on how to use this operator, take a look at the guide: Templated jobs

Parameters

template (str) – The reference to the Dataflow template.
job_name (str) – The ‘jobName’ to use when executing the Dataflow template (templated).
options (dict[str, Any] | None) –
Map of job runtime environment options. It will update environment argument if passed.

See also

For more information on possible configurations, look at the API documentation https://cloud.google.com/dataflow/pipelines/specifying-exec-params
dataflow_default_options (dict[str, Any] | None) – Map of default job environment options.
parameters (dict[str, str] | None) – Map of job specific parameters for the template.
project_id (str) – Optional, the Google Cloud project ID in which to start a job. If set to None or missing, the default project_id from the Google Cloud connection is used.
location (str | None) – Job location.
gcp_conn_id (str) – The connection ID to use connecting to Google Cloud.
poll_sleep (int) – The time in seconds to sleep between polling Google Cloud Platform for the dataflow job status while the job is in the JOB_STATE_RUNNING state.
impersonation_chain (str | collections.abc.Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).
environment (dict | None) –
Optional, Map of job runtime environment options.

See also

For more information on possible configurations, look at the API documentation https://cloud.google.com/dataflow/pipelines/specifying-exec-params
cancel_timeout (int | None) – How long (in seconds) operator should wait for the pipeline to be successfully cancelled when task is being killed.
append_job_name (bool) – True if unique suffix has to be appended to job name.
wait_until_finished (bool | None) –
(Optional) If True, wait for the end of pipeline execution before exiting. If False, only submits job. If None, default behavior.

The default behavior depends on the type of pipeline:
- for the streaming pipeline, wait for jobs to start,
- for the batch pipeline, wait for the jobs to complete.
Warning

You cannot call PipelineResult.wait_until_finish method in your pipeline code for the operator to work properly. i. e. you must use asynchronous execution. Otherwise, your pipeline will always wait until finished. For more information, look at: Asynchronous execution

The process of starting the Dataflow job in Airflow consists of two steps:
- running a subprocess and reading the stderr/stderr log for the job id.
- loop waiting for the end of the job ID from the previous step. This loop checks the status of the job.
Step two is started just after step one has finished, so if you have wait_until_finished in your pipeline code, step two will not start until the process stops. When this process stops, steps two will run, but it will only execute one iteration as the job will be in a terminal state.

If you in your pipeline do not call the wait_for_pipeline method but pass wait_until_finish=True to the operator, the second loop will wait for the job’s terminal state.

If you in your pipeline do not call the wait_for_pipeline method, and pass wait_until_finish=False to the operator, the second loop will check once is job not in terminal state and exit the loop.
expected_terminal_state (str | None) – The expected terminal state of the operator on which the corresponding Airflow task succeeds. When not specified, it will be determined by the hook.

It’s a good practice to define dataflow_* parameters in the default_args of the dag like the project, zone and staging location.

default_args = {
    "dataflow_default_options": {
        "zone": "europe-west1-d",
        "tempLocation": "gs://my-staging-bucket/staging/",
    }
}

You need to pass the path to your dataflow template as a file reference with the template parameter. Use parameters to pass on parameters to your job. Use environment to pass on runtime environment variables to your job.

t1 = DataflowTemplatedJobStartOperator(
    task_id="dataflow_example",
    template="{{var.value.gcp_dataflow_base}}",
    parameters={
        "inputFile": "gs://bucket/input/my_input.txt",
        "outputFile": "gs://bucket/output/my_output.txt",
    },
    gcp_conn_id="airflow-conn-id",
    dag=my_dag,
)

template, dataflow_default_options, parameters, and job_name are templated, so you can use variables in them.

Note that dataflow_default_options is expected to save high-level options for project information, which apply to all dataflow operators in the DAG.

See also

https://cloud.google.com/dataflow/docs/reference/rest/v1b3 /LaunchTemplateParameters https://cloud.google.com/dataflow/docs/reference/rest/v1b3/RuntimeEnvironment For more detail on job template execution have a look at the reference: https://cloud.google.com/dataflow/docs/templates/executing-templates

Parameters: deferrable (bool) – Run operator in the deferrable mode.

template_fields: collections.abc.Sequence[str] = ('template', 'job_name', 'options', 'parameters', 'project_id', 'location', 'gcp_conn_id',...[source]¶

ui_color = '#0273d4'[source]¶

operator_extra_links = ()[source]¶

hook()[source]¶

execute(context)[source]¶

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

execute_complete(context, event)[source]¶

Execute after trigger finishes its work.

on_kill()[source]¶

Override this method to clean up subprocesses when a task instance gets killed.

Any use of the threading, subprocess or multiprocessing module within an operator needs to be cleaned up, or it will leave ghost processes behind.

class airflow.providers.google.cloud.operators.dataflow.DataflowStartFlexTemplateOperator(body, location, project_id=PROVIDE_PROJECT_ID, gcp_conn_id='google_cloud_default', drain_pipeline=False, cancel_timeout=10 * 60, wait_until_finished=None, impersonation_chain=None, deferrable=conf.getboolean('operators', 'default_deferrable', fallback=False), append_job_name=True, expected_terminal_state=None, poll_sleep=10, *args, **kwargs)[source]¶

Bases: airflow.providers.google.cloud.operators.cloud_base.GoogleCloudBaseOperator

Starts a Dataflow Job with a Flex Template.

See also

For more information on how to use this operator, take a look at the guide: Templated jobs

Parameters

body (dict) – The request body. See: https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.locations.flexTemplates/launch#request-body
location (str) – The location of the Dataflow job (for example europe-west1)
project_id (str) – The ID of the GCP project that owns the job.
gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform.
drain_pipeline (bool) – Optional, set to True if want to stop streaming job by draining it instead of canceling during killing task instance. See: https://cloud.google.com/dataflow/docs/guides/stopping-a-pipeline
cancel_timeout (int | None) – How long (in seconds) operator should wait for the pipeline to be successfully cancelled when task is being killed.
wait_until_finished (bool | None) –
(Optional) If True, wait for the end of pipeline execution before exiting. If False, only submits job. If None, default behavior.

The default behavior depends on the type of pipeline:
- for the streaming pipeline, wait for jobs to start,
- for the batch pipeline, wait for the jobs to complete.
Warning

You cannot call PipelineResult.wait_until_finish method in your pipeline code for the operator to work properly. i. e. you must use asynchronous execution. Otherwise, your pipeline will always wait until finished. For more information, look at: Asynchronous execution

The process of starting the Dataflow job in Airflow consists of two steps:
- running a subprocess and reading the stderr/stderr log for the job id.
- loop waiting for the end of the job ID from the previous step. This loop checks the status of the job.
Step two is started just after step one has finished, so if you have wait_until_finished in your pipeline code, step two will not start until the process stops. When this process stops, steps two will run, but it will only execute one iteration as the job will be in a terminal state.

If you in your pipeline do not call the wait_for_pipeline method but pass wait_until_finished=True to the operator, the second loop will wait for the job’s terminal state.

If you in your pipeline do not call the wait_for_pipeline method, and pass wait_until_finished=False to the operator, the second loop will check once is job not in terminal state and exit the loop.
impersonation_chain (str | collections.abc.Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).
deferrable (bool) – Run operator in the deferrable mode.
expected_terminal_state (str | None) – The expected final status of the operator on which the corresponding Airflow task succeeds. When not specified, it will be determined by the hook.
append_job_name (bool) – True if unique suffix has to be appended to job name.
poll_sleep (int) – The time in seconds to sleep between polling Google Cloud Platform for the dataflow job status while the job is in the JOB_STATE_RUNNING state.

template_fields: collections.abc.Sequence[str] = ('body', 'location', 'project_id', 'gcp_conn_id')[source]¶

operator_extra_links = ()[source]¶

hook()[source]¶

execute(context)[source]¶

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

execute_complete(context, event)[source]¶

Execute after trigger finishes its work.

on_kill()[source]¶

Override this method to clean up subprocesses when a task instance gets killed.

Any use of the threading, subprocess or multiprocessing module within an operator needs to be cleaned up, or it will leave ghost processes behind.

class airflow.providers.google.cloud.operators.dataflow.DataflowStartSqlJobOperator(job_name, query, options, location=DEFAULT_DATAFLOW_LOCATION, project_id=PROVIDE_PROJECT_ID, gcp_conn_id='google_cloud_default', drain_pipeline=False, impersonation_chain=None, *args, **kwargs)[source]¶

Bases: airflow.providers.google.cloud.operators.cloud_base.GoogleCloudBaseOperator

Starts Dataflow SQL query.

See also

For more information on how to use this operator, take a look at the guide: Dataflow SQL

Warning

This operator requires gcloud command (Google Cloud SDK) must be installed on the Airflow worker <https://cloud.google.com/sdk/docs/install>`__

Parameters

job_name (str) – The unique name to assign to the Cloud Dataflow job.
query (str) – The SQL query to execute.
options (dict[str, Any]) –
Job parameters to be executed. It can be a dictionary with the following keys.

For more information, look at: https://cloud.google.com/sdk/gcloud/reference/beta/dataflow/sql/query command reference
location (str) – The location of the Dataflow job (for example europe-west1)
project_id (str) – The ID of the GCP project that owns the job. If set to None or missing, the default project_id from the GCP connection is used.
gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform.
drain_pipeline (bool) – Optional, set to True if want to stop streaming job by draining it instead of canceling during killing task instance. See: https://cloud.google.com/dataflow/docs/guides/stopping-a-pipeline
impersonation_chain (str | collections.abc.Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

template_fields: collections.abc.Sequence[str] = ('job_name', 'query', 'options', 'location', 'project_id', 'gcp_conn_id')[source]¶

template_fields_renderers[source]¶

execute(context)[source]¶

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

on_kill()[source]¶

Override this method to clean up subprocesses when a task instance gets killed.

Any use of the threading, subprocess or multiprocessing module within an operator needs to be cleaned up, or it will leave ghost processes behind.

class airflow.providers.google.cloud.operators.dataflow.DataflowStartYamlJobOperator(*, job_name, yaml_pipeline_file, region=DEFAULT_DATAFLOW_LOCATION, project_id=PROVIDE_PROJECT_ID, gcp_conn_id='google_cloud_default', append_job_name=True, drain_pipeline=False, deferrable=conf.getboolean('operators', 'default_deferrable', fallback=False), poll_sleep=10, cancel_timeout=5 * 60, expected_terminal_state=None, jinja_variables=None, options=None, impersonation_chain=None, **kwargs)[source]¶

Bases: airflow.providers.google.cloud.operators.cloud_base.GoogleCloudBaseOperator

Launch a Dataflow YAML job and return the result.

See also

For more information on how to use this operator, take a look at the guide: Dataflow YAML

Warning

This operator requires gcloud command (Google Cloud SDK) must be installed on the Airflow worker <https://cloud.google.com/sdk/docs/install>`__

Parameters

job_name (str) – Required. The unique name to assign to the Cloud Dataflow job.
yaml_pipeline_file (str) – Required. Path to a file defining the YAML pipeline to run. Must be a local file or a URL beginning with ‘gs://’.
region (str) – Optional. Region ID of the job’s regional endpoint. Defaults to ‘us-central1’.
project_id (str) – Required. The ID of the GCP project that owns the job. If set to None or missing, the default project_id from the GCP connection is used.
gcp_conn_id (str) – Optional. The connection ID used to connect to GCP.
append_job_name (bool) – Optional. Set to True if a unique suffix has to be appended to the job_name. Defaults to True.
drain_pipeline (bool) – Optional. Set to True if you want to stop a streaming pipeline job by draining it instead of canceling when killing the task instance. Note that this does not work for batch pipeline jobs or in the deferrable mode. Defaults to False. For more info see: https://cloud.google.com/dataflow/docs/guides/stopping-a-pipeline
deferrable (bool) – Optional. Run operator in the deferrable mode.
expected_terminal_state (str | None) – Optional. The expected terminal state of the Dataflow job at which the operator task is set to succeed. Defaults to ‘JOB_STATE_DONE’ for the batch jobs and ‘JOB_STATE_RUNNING’ for the streaming jobs.
poll_sleep (int) – Optional. The time in seconds to sleep between polling Google Cloud Platform for the Dataflow job status. Used both for the sync and deferrable mode.
cancel_timeout (int | None) – Optional. How long (in seconds) operator should wait for the pipeline to be successfully canceled when the task is being killed.
jinja_variables (dict[str, str] | None) – Optional. A dictionary of Jinja2 variables to be used in reifying the yaml pipeline file.
options (dict[str, Any] | None) – Optional. Additional gcloud or Beam job parameters. It must be a dictionary with the keys matching the optional flag names in gcloud. The list of supported flags can be found at: https://cloud.google.com/sdk/gcloud/reference/dataflow/yaml/run. Note that if a flag does not require a value, then its dictionary value must be either True or None. For example, the –log-http flag can be passed as {‘log-http’: True}.
impersonation_chain (str | collections.abc.Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

Returns

Dictionary containing the job’s data.

template_fields: collections.abc.Sequence[str] = ('job_name', 'yaml_pipeline_file', 'jinja_variables', 'options', 'region', 'project_id', 'gcp_conn_id')[source]¶

template_fields_renderers[source]¶

operator_extra_links = ()[source]¶

execute(context)[source]¶

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

execute_complete(context, event)[source]¶

Execute after the trigger returns an event.

on_kill()[source]¶

Cancel the dataflow job if a task instance gets killed.

This method will not be called if a task instance is killed in a deferred state.

hook()[source]¶

class airflow.providers.google.cloud.operators.dataflow.DataflowStopJobOperator(job_name_prefix=None, job_id=None, project_id=PROVIDE_PROJECT_ID, location=DEFAULT_DATAFLOW_LOCATION, gcp_conn_id='google_cloud_default', poll_sleep=10, impersonation_chain=None, stop_timeout=10 * 60, drain_pipeline=True, **kwargs)[source]¶

Bases: airflow.providers.google.cloud.operators.cloud_base.GoogleCloudBaseOperator

Stops the job with the specified name prefix or Job ID.

All jobs with provided name prefix will be stopped. Streaming jobs are drained by default.

Parameter job_name_prefix and job_id are mutually exclusive.

See also

For more details on stopping a pipeline see: https://cloud.google.com/dataflow/docs/guides/stopping-a-pipeline

See also

For more information on how to use this operator, take a look at the guide: Stopping a pipeline

Parameters

job_name_prefix (str | None) – Name prefix specifying which jobs are to be stopped.
job_id (str | None) – Job ID specifying which jobs are to be stopped.
project_id (str) – Optional, the Google Cloud project ID in which to start a job. If set to None or missing, the default project_id from the Google Cloud connection is used.
location (str) – Optional, Job location. If set to None or missing, “us-central1” will be used.
gcp_conn_id (str) – The connection ID to use connecting to Google Cloud.
poll_sleep (int) – The time in seconds to sleep between polling Google Cloud Platform for the dataflow job status to confirm it’s stopped.
impersonation_chain (str | collections.abc.Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).
drain_pipeline (bool) – Optional, set to False if want to stop streaming job by canceling it instead of draining. See: https://cloud.google.com/dataflow/docs/guides/stopping-a-pipeline
stop_timeout (int | None) – wait time in seconds for successful job canceling/draining

template_fields = ['job_id', 'project_id', 'impersonation_chain'][source]¶

execute(context)[source]¶

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

class airflow.providers.google.cloud.operators.dataflow.DataflowCreatePipelineOperator(*, body, project_id=PROVIDE_PROJECT_ID, location=DEFAULT_DATAFLOW_LOCATION, gcp_conn_id='google_cloud_default', impersonation_chain=None, **kwargs)[source]¶

Bases: airflow.providers.google.cloud.operators.cloud_base.GoogleCloudBaseOperator

Creates a new Dataflow Data Pipeline instance.

See also

For more information on how to use this operator, take a look at the guide: JSON-formatted pipelines

Parameters

body (dict) – The request body (contains instance of Pipeline). See: https://cloud.google.com/dataflow/docs/reference/data-pipelines/rest/v1/projects.locations.pipelines/create#request-body
project_id (str) – The ID of the GCP project that owns the job.
location (str) – The location to direct the Data Pipelines instance to (for example us-central1).
gcp_conn_id (str) – The connection ID to connect to the Google Cloud Platform.
impersonation_chain (str | collections.abc.Sequence[str] | None) –
Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

Warning

This option requires Apache Beam 2.39.0 or newer.

Returns the created Dataflow Data Pipeline instance in JSON representation.

operator_extra_links = ()[source]¶

execute(context)[source]¶

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

class airflow.providers.google.cloud.operators.dataflow.DataflowRunPipelineOperator(pipeline_name, project_id=PROVIDE_PROJECT_ID, location=DEFAULT_DATAFLOW_LOCATION, gcp_conn_id='google_cloud_default', impersonation_chain=None, **kwargs)[source]¶

Bases: airflow.providers.google.cloud.operators.cloud_base.GoogleCloudBaseOperator

Runs a Dataflow Data Pipeline.

See also

For more information on how to use this operator, take a look at the guide: JSON-formatted pipelines

Parameters

pipeline_name (str) – The display name of the pipeline. In example projects/PROJECT_ID/locations/LOCATION_ID/pipelines/PIPELINE_ID it would be the PIPELINE_ID.
project_id (str) – The ID of the GCP project that owns the job.
location (str) – The location to direct the Data Pipelines instance to (for example us-central1).
gcp_conn_id (str) – The connection ID to connect to the Google Cloud Platform.
impersonation_chain (str | collections.abc.Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

Returns the created Job in JSON representation.

operator_extra_links = ()[source]¶

execute(context)[source]¶

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

class airflow.providers.google.cloud.operators.dataflow.DataflowDeletePipelineOperator(pipeline_name, project_id=PROVIDE_PROJECT_ID, location=DEFAULT_DATAFLOW_LOCATION, gcp_conn_id='google_cloud_default', impersonation_chain=None, **kwargs)[source]¶

Bases: airflow.providers.google.cloud.operators.cloud_base.GoogleCloudBaseOperator

Deletes a Dataflow Data Pipeline.

See also

For more information on how to use this operator, take a look at the guide: Deleting a pipeline

Parameters

pipeline_name (str) – The display name of the pipeline. In example projects/PROJECT_ID/locations/LOCATION_ID/pipelines/PIPELINE_ID it would be the PIPELINE_ID.
project_id (str) – The ID of the GCP project that owns the job.
location (str) – The location to direct the Data Pipelines instance to (for example us-central1).
gcp_conn_id (str) – The connection ID to connect to the Google Cloud Platform.
impersonation_chain (str | collections.abc.Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

execute(context)[source]¶

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

airflow.providers.google.cloud.operators.dataflow¶

Module Contents¶

Classes¶

`airflow.providers.google.cloud.operators.dataflow`¶