airflow.providers.google.cloud.hooks.dataflow¶
This module contains a Google Dataflow Hook.
Module Contents¶
Classes¶
| Helper class with Dataflow job statuses. | |
| Helper class with Dataflow job types. | |
| Hook for Google Dataflow. | |
| Async hook class for dataflow service. | 
Functions¶
| Returns callback which triggers function passed as on_new_job_id_callback when Dataflow job_id is found. | 
Attributes¶
- airflow.providers.google.cloud.hooks.dataflow.process_line_and_extract_dataflow_job_id_callback(on_new_job_id_callback)[source]¶
- Returns callback which triggers function passed as on_new_job_id_callback when Dataflow job_id is found. To be used for process_line_callback in - BeamCommandRunner- Parameters
- on_new_job_id_callback (Callable[[str], None] | None) – Callback called when the job ID is known 
 
- class airflow.providers.google.cloud.hooks.dataflow.DataflowJobStatus[source]¶
- Helper class with Dataflow job statuses. Reference: https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.jobs#Job.JobState 
- class airflow.providers.google.cloud.hooks.dataflow.DataflowJobType[source]¶
- Helper class with Dataflow job types. 
- class airflow.providers.google.cloud.hooks.dataflow.DataflowHook(gcp_conn_id='google_cloud_default', poll_sleep=10, impersonation_chain=None, drain_pipeline=False, cancel_timeout=5 * 60, wait_until_finished=None, **kwargs)[source]¶
- Bases: - airflow.providers.google.common.hooks.base_google.GoogleBaseHook- Hook for Google Dataflow. - All the methods in the hook where project_id is used must be called with keyword arguments rather than positional. - start_java_dataflow(job_name, variables, jar, project_id, job_class=None, append_job_name=True, multiple_jobs=False, on_new_job_id_callback=None, location=DEFAULT_DATAFLOW_LOCATION)[source]¶
- Starts Dataflow java job. - Parameters
- job_name (str) – The name of the job. 
- variables (dict) – Variables passed to the job. 
- project_id (str) – Optional, the Google Cloud project ID in which to start a job. If set to None or missing, the default project_id from the Google Cloud connection is used. 
- jar (str) – Name of the jar for the job 
- job_class (str | None) – Name of the java class for the job. 
- append_job_name (bool) – True if unique suffix has to be appended to job name. 
- multiple_jobs (bool) – True if to check for multiple job in dataflow 
- on_new_job_id_callback (Callable[[str], None] | None) – Callback called when the job ID is known. 
- location (str) – Job location. 
 
 
 - start_template_dataflow(job_name, variables, parameters, dataflow_template, project_id, append_job_name=True, on_new_job_id_callback=None, on_new_job_callback=None, location=DEFAULT_DATAFLOW_LOCATION, environment=None)[source]¶
- Starts Dataflow template job. - Parameters
- job_name (str) – The name of the job. 
- variables (dict) – - Map of job runtime environment options. It will update environment argument if passed. - See also - For more information on possible configurations, look at the API documentation https://cloud.google.com/dataflow/pipelines/specifying-exec-params 
- parameters (dict) – Parameters for the template 
- dataflow_template (str) – GCS path to the template. 
- project_id (str) – Optional, the Google Cloud project ID in which to start a job. If set to None or missing, the default project_id from the Google Cloud connection is used. 
- append_job_name (bool) – True if unique suffix has to be appended to job name. 
- on_new_job_id_callback (Callable[[str], None] | None) – (Deprecated) Callback called when the Job is known. 
- on_new_job_callback (Callable[[dict], None] | None) – Callback called when the Job is known. 
- location (str) – - Job location. - See also - For more information on possible configurations, look at the API documentation https://cloud.google.com/dataflow/pipelines/specifying-exec-params 
 
 
 - start_flex_template(body, location, project_id, on_new_job_id_callback=None, on_new_job_callback=None)[source]¶
- Starts flex templates with the Dataflow pipeline. - Parameters
- body (dict) – The request body. See: https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.locations.flexTemplates/launch#request-body 
- location (str) – The location of the Dataflow job (for example europe-west1) 
- project_id (str) – The ID of the GCP project that owns the job. If set to - Noneor missing, the default project_id from the GCP connection is used.
- on_new_job_id_callback (Callable[[str], None] | None) – (Deprecated) A callback that is called when a Job ID is detected. 
- on_new_job_callback (Callable[[dict], None] | None) – A callback that is called when a Job is detected. 
 
- Returns
- the Job 
- Return type
 
 - start_python_dataflow(job_name, variables, dataflow, py_options, project_id, py_interpreter='python3', py_requirements=None, py_system_site_packages=False, append_job_name=True, on_new_job_id_callback=None, location=DEFAULT_DATAFLOW_LOCATION)[source]¶
- Starts Dataflow job. - Parameters
- job_name (str) – The name of the job. 
- variables (dict) – Variables passed to the job. 
- dataflow (str) – Name of the Dataflow process. 
- project_id (str) – The ID of the GCP project that owns the job. If set to - Noneor missing, the default project_id from the GCP connection is used.
- py_interpreter (str) – Python version of the beam pipeline. If None, this defaults to the python3. To track python versions supported by beam and related issues check: https://issues.apache.org/jira/browse/BEAM-1251 
- py_requirements (list[str] | None) – - Additional python package(s) to install. If a value is passed to this parameter, a new virtual environment has been created with additional packages installed. - You could also install the apache-beam package if it is not installed on your system or you want to use a different version. 
- py_system_site_packages (bool) – - Whether to include system_site_packages in your virtualenv. See virtualenv documentation for more information. - This option is only relevant if the - py_requirementsparameter is not None.
- append_job_name (bool) – True if unique suffix has to be appended to job name. 
- project_id – Optional, the Google Cloud project ID in which to start a job. If set to None or missing, the default project_id from the Google Cloud connection is used. 
- on_new_job_id_callback (Callable[[str], None] | None) – Callback called when the job ID is known. 
- location (str) – Job location. 
 
 
 - is_job_dataflow_running(name, project_id, location=DEFAULT_DATAFLOW_LOCATION, variables=None)[source]¶
- Helper method to check if jos is still running in dataflow 
 - cancel_job(project_id, job_name=None, job_id=None, location=DEFAULT_DATAFLOW_LOCATION)[source]¶
- Cancels the job with the specified name prefix or Job ID. - Parameter - nameand- job_idare mutually exclusive.- Parameters
- job_name (str | None) – Name prefix specifying which jobs are to be canceled. 
- job_id (str | None) – Job ID specifying which jobs are to be canceled. 
- location (str) – Job location. 
- project_id (str) – Optional, the Google Cloud project ID in which to start a job. If set to None or missing, the default project_id from the Google Cloud connection is used. 
 
 
 - start_sql_job(job_name, query, options, project_id, location=DEFAULT_DATAFLOW_LOCATION, on_new_job_id_callback=None, on_new_job_callback=None)[source]¶
- Starts Dataflow SQL query. - Parameters
- job_name (str) – The unique name to assign to the Cloud Dataflow job. 
- query (str) – The SQL query to execute. 
- options (dict[str, Any]) – Job parameters to be executed. For more information, look at: https://cloud.google.com/sdk/gcloud/reference/beta/dataflow/sql/query command reference 
- location (str) – The location of the Dataflow job (for example europe-west1) 
- project_id (str) – The ID of the GCP project that owns the job. If set to - Noneor missing, the default project_id from the GCP connection is used.
- on_new_job_id_callback (Callable[[str], None] | None) – (Deprecated) Callback called when the job ID is known. 
- on_new_job_callback (Callable[[dict], None] | None) – Callback called when the job is known. 
 
- Returns
- the new job object 
 
 - get_job(job_id, project_id=PROVIDE_PROJECT_ID, location=DEFAULT_DATAFLOW_LOCATION)[source]¶
- Gets the job with the specified Job ID. - Parameters
- job_id (str) – Job ID to get. 
- project_id (str) – Optional, the Google Cloud project ID in which to start a job. If set to None or missing, the default project_id from the Google Cloud connection is used. 
- location (str) – The location of the Dataflow job (for example europe-west1). See: https://cloud.google.com/dataflow/docs/concepts/regional-endpoints 
 
- Returns
- the Job 
- Return type
 
 - fetch_job_metrics_by_id(job_id, project_id, location=DEFAULT_DATAFLOW_LOCATION)[source]¶
- Gets the job metrics with the specified Job ID. - Parameters
- job_id (str) – Job ID to get. 
- project_id (str) – Optional, the Google Cloud project ID in which to start a job. If set to None or missing, the default project_id from the Google Cloud connection is used. 
- location (str) – The location of the Dataflow job (for example europe-west1). See: https://cloud.google.com/dataflow/docs/concepts/regional-endpoints 
 
- Returns
- the JobMetrics. See: https://cloud.google.com/dataflow/docs/reference/rest/v1b3/JobMetrics 
- Return type
 
 - fetch_job_messages_by_id(job_id, project_id, location=DEFAULT_DATAFLOW_LOCATION)[source]¶
- Gets the job messages with the specified Job ID. - Parameters
- Returns
- the list of JobMessages. See: https://cloud.google.com/dataflow/docs/reference/rest/v1b3/ListJobMessagesResponse#JobMessage 
- Return type
 
 - fetch_job_autoscaling_events_by_id(job_id, project_id, location=DEFAULT_DATAFLOW_LOCATION)[source]¶
- Gets the job autoscaling events with the specified Job ID. - Parameters
- Returns
- the list of AutoscalingEvents. See: https://cloud.google.com/dataflow/docs/reference/rest/v1b3/ListJobMessagesResponse#autoscalingevent 
- Return type
 
 - wait_for_done(job_name, location, project_id, job_id=None, multiple_jobs=False)[source]¶
- Wait for Dataflow job. - Parameters
- job_name (str) – The ‘jobName’ to use when executing the DataFlow job (templated). This ends up being set in the pipeline options, so any entry with key - 'jobName'in- optionswill be overwritten.
- location (str) – location the job is running 
- project_id (str) – Optional, the Google Cloud project ID in which to start a job. If set to None or missing, the default project_id from the Google Cloud connection is used. 
- job_id (str | None) – a Dataflow job ID 
- multiple_jobs (bool) – If pipeline creates multiple jobs then monitor all jobs 
 
 
 
- class airflow.providers.google.cloud.hooks.dataflow.AsyncDataflowHook(**kwargs)[source]¶
- Bases: - airflow.providers.google.common.hooks.base_google.GoogleBaseAsyncHook- Async hook class for dataflow service. - async initialize_client(client_class)[source]¶
- Initialize object of the given class. - Method is used to initialize asynchronous client. Because of the big amount of the classes which are used for Dataflow service it was decided to initialize them the same way with credentials which are received from the method of the GoogleBaseHook class. :param client_class: Class of the Google cloud SDK 
 - async get_job(job_id, project_id=PROVIDE_PROJECT_ID, job_view=JobView.JOB_VIEW_SUMMARY, location=DEFAULT_DATAFLOW_LOCATION)[source]¶
- Gets the job with the specified Job ID. - Parameters
- job_id (str) – Job ID to get. 
- project_id (str) – the Google Cloud project ID in which to start a job. If set to None or missing, the default project_id from the Google Cloud connection is used. 
- job_view (int) – Optional. JobView object which determines representation of the returned data 
- location (str) – Optional. The location of the Dataflow job (for example europe-west1). See: https://cloud.google.com/dataflow/docs/concepts/regional-endpoints 
 
 
 - async get_job_status(job_id, project_id=PROVIDE_PROJECT_ID, job_view=JobView.JOB_VIEW_SUMMARY, location=DEFAULT_DATAFLOW_LOCATION)[source]¶
- Gets the job status with the specified Job ID. - Parameters
- job_id (str) – Job ID to get. 
- project_id (str) – the Google Cloud project ID in which to start a job. If set to None or missing, the default project_id from the Google Cloud connection is used. 
- job_view (int) – Optional. JobView object which determines representation of the returned data 
- location (str) – Optional. The location of the Dataflow job (for example europe-west1). See: https://cloud.google.com/dataflow/docs/concepts/regional-endpoints 
 
 
 
