airflow.providers.google.cloud.hooks.dataflow
¶
This module contains a Google Dataflow Hook.
Module Contents¶
Classes¶
Helper class with Dataflow job statuses. |
|
Helper class with Dataflow job types. |
|
Hook for Google Dataflow. |
Functions¶
Returns callback which triggers function passed as on_new_job_id_callback when Dataflow job_id is found. |
Attributes¶
- airflow.providers.google.cloud.hooks.dataflow.process_line_and_extract_dataflow_job_id_callback(on_new_job_id_callback)[source]¶
Returns callback which triggers function passed as on_new_job_id_callback when Dataflow job_id is found. To be used for process_line_callback in
BeamCommandRunner
- Parameters
on_new_job_id_callback (Optional[Callable[[str], None]]) – Callback called when the job ID is known
- class airflow.providers.google.cloud.hooks.dataflow.DataflowJobStatus[source]¶
Helper class with Dataflow job statuses. Reference: https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.jobs#Job.JobState
- class airflow.providers.google.cloud.hooks.dataflow.DataflowJobType[source]¶
Helper class with Dataflow job types.
- class airflow.providers.google.cloud.hooks.dataflow.DataflowHook(gcp_conn_id='google_cloud_default', delegate_to=None, poll_sleep=10, impersonation_chain=None, drain_pipeline=False, cancel_timeout=5 * 60, wait_until_finished=None)[source]¶
Bases:
airflow.providers.google.common.hooks.base_google.GoogleBaseHook
Hook for Google Dataflow.
All the methods in the hook where project_id is used must be called with keyword arguments rather than positional.
- start_java_dataflow(job_name, variables, jar, project_id, job_class=None, append_job_name=True, multiple_jobs=False, on_new_job_id_callback=None, location=DEFAULT_DATAFLOW_LOCATION)[source]¶
Starts Dataflow java job.
- Parameters
job_name (str) – The name of the job.
variables (dict) – Variables passed to the job.
project_id (str) – Optional, the Google Cloud project ID in which to start a job. If set to None or missing, the default project_id from the Google Cloud connection is used.
jar (str) – Name of the jar for the job
job_class (Optional[str]) – Name of the java class for the job.
append_job_name (bool) – True if unique suffix has to be appended to job name.
multiple_jobs (bool) – True if to check for multiple job in dataflow
on_new_job_id_callback (Optional[Callable[[str], None]]) – Callback called when the job ID is known.
location (str) – Job location.
- start_template_dataflow(job_name, variables, parameters, dataflow_template, project_id, append_job_name=True, on_new_job_id_callback=None, on_new_job_callback=None, location=DEFAULT_DATAFLOW_LOCATION, environment=None)[source]¶
Starts Dataflow template job.
- Parameters
job_name (str) – The name of the job.
variables (dict) –
Map of job runtime environment options. It will update environment argument if passed.
See also
For more information on possible configurations, look at the API documentation https://cloud.google.com/dataflow/pipelines/specifying-exec-params
parameters (dict) – Parameters for the template
dataflow_template (str) – GCS path to the template.
project_id (str) – Optional, the Google Cloud project ID in which to start a job. If set to None or missing, the default project_id from the Google Cloud connection is used.
append_job_name (bool) – True if unique suffix has to be appended to job name.
on_new_job_id_callback (Optional[Callable[[str], None]]) – (Deprecated) Callback called when the Job is known.
on_new_job_callback (Optional[Callable[[dict], None]]) – Callback called when the Job is known.
location (str) –
Job location.
See also
For more information on possible configurations, look at the API documentation https://cloud.google.com/dataflow/pipelines/specifying-exec-params
- start_flex_template(body, location, project_id, on_new_job_id_callback=None, on_new_job_callback=None)[source]¶
Starts flex templates with the Dataflow pipeline.
- Parameters
body (dict) – The request body. See: https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.locations.flexTemplates/launch#request-body
location (str) – The location of the Dataflow job (for example europe-west1)
project_id (str) – The ID of the GCP project that owns the job. If set to
None
or missing, the default project_id from the GCP connection is used.on_new_job_id_callback (Optional[Callable[[str], None]]) – (Deprecated) A callback that is called when a Job ID is detected.
on_new_job_callback (Optional[Callable[[dict], None]]) – A callback that is called when a Job is detected.
- Returns
the Job
- start_python_dataflow(job_name, variables, dataflow, py_options, project_id, py_interpreter='python3', py_requirements=None, py_system_site_packages=False, append_job_name=True, on_new_job_id_callback=None, location=DEFAULT_DATAFLOW_LOCATION)[source]¶
Starts Dataflow job.
- Parameters
job_name (str) – The name of the job.
variables (dict) – Variables passed to the job.
dataflow (str) – Name of the Dataflow process.
py_options (List[str]) – Additional options.
project_id (str) – The ID of the GCP project that owns the job. If set to
None
or missing, the default project_id from the GCP connection is used.py_interpreter (str) – Python version of the beam pipeline. If None, this defaults to the python3. To track python versions supported by beam and related issues check: https://issues.apache.org/jira/browse/BEAM-1251
py_requirements (Optional[List[str]]) –
Additional python package(s) to install. If a value is passed to this parameter, a new virtual environment has been created with additional packages installed.
You could also install the apache-beam package if it is not installed on your system or you want to use a different version.
py_system_site_packages (bool) –
Whether to include system_site_packages in your virtualenv. See virtualenv documentation for more information.
This option is only relevant if the
py_requirements
parameter is not None.append_job_name (bool) – True if unique suffix has to be appended to job name.
project_id – Optional, the Google Cloud project ID in which to start a job. If set to None or missing, the default project_id from the Google Cloud connection is used.
on_new_job_id_callback (Optional[Callable[[str], None]]) – Callback called when the job ID is known.
location (str) – Job location.
- is_job_dataflow_running(name, project_id, location=DEFAULT_DATAFLOW_LOCATION, variables=None)[source]¶
Helper method to check if jos is still running in dataflow
- cancel_job(project_id, job_name=None, job_id=None, location=DEFAULT_DATAFLOW_LOCATION)[source]¶
Cancels the job with the specified name prefix or Job ID.
Parameter
name
andjob_id
are mutually exclusive.- Parameters
job_name (Optional[str]) – Name prefix specifying which jobs are to be canceled.
job_id (Optional[str]) – Job ID specifying which jobs are to be canceled.
location (str) – Job location.
project_id (str) – Optional, the Google Cloud project ID in which to start a job. If set to None or missing, the default project_id from the Google Cloud connection is used.
- start_sql_job(job_name, query, options, project_id, location=DEFAULT_DATAFLOW_LOCATION, on_new_job_id_callback=None, on_new_job_callback=None)[source]¶
Starts Dataflow SQL query.
- Parameters
job_name (str) – The unique name to assign to the Cloud Dataflow job.
query (str) – The SQL query to execute.
options (Dict[str, Any]) – Job parameters to be executed. For more information, look at: https://cloud.google.com/sdk/gcloud/reference/beta/dataflow/sql/query command reference
location (str) – The location of the Dataflow job (for example europe-west1)
project_id (str) – The ID of the GCP project that owns the job. If set to
None
or missing, the default project_id from the GCP connection is used.on_new_job_id_callback (Optional[Callable[[str], None]]) – (Deprecated) Callback called when the job ID is known.
on_new_job_callback (Optional[Callable[[dict], None]]) – Callback called when the job is known.
- Returns
the new job object
- get_job(job_id, project_id, location=DEFAULT_DATAFLOW_LOCATION)[source]¶
Gets the job with the specified Job ID.
- Parameters
job_id (str) – Job ID to get.
project_id (str) – Optional, the Google Cloud project ID in which to start a job. If set to None or missing, the default project_id from the Google Cloud connection is used.
location (str) – The location of the Dataflow job (for example europe-west1). See: https://cloud.google.com/dataflow/docs/concepts/regional-endpoints
- Returns
the Job
- Return type
- fetch_job_metrics_by_id(job_id, project_id, location=DEFAULT_DATAFLOW_LOCATION)[source]¶
Gets the job metrics with the specified Job ID.
- Parameters
job_id (str) – Job ID to get.
project_id (str) – Optional, the Google Cloud project ID in which to start a job. If set to None or missing, the default project_id from the Google Cloud connection is used.
location (str) – The location of the Dataflow job (for example europe-west1). See: https://cloud.google.com/dataflow/docs/concepts/regional-endpoints
- Returns
the JobMetrics. See: https://cloud.google.com/dataflow/docs/reference/rest/v1b3/JobMetrics
- Return type
- fetch_job_messages_by_id(job_id, project_id, location=DEFAULT_DATAFLOW_LOCATION)[source]¶
Gets the job messages with the specified Job ID.
- Parameters
- Returns
the list of JobMessages. See: https://cloud.google.com/dataflow/docs/reference/rest/v1b3/ListJobMessagesResponse#JobMessage
- Return type
List[dict]
- fetch_job_autoscaling_events_by_id(job_id, project_id, location=DEFAULT_DATAFLOW_LOCATION)[source]¶
Gets the job autoscaling events with the specified Job ID.
- Parameters
- Returns
the list of AutoscalingEvents. See: https://cloud.google.com/dataflow/docs/reference/rest/v1b3/ListJobMessagesResponse#autoscalingevent
- Return type
List[dict]
- wait_for_done(job_name, location, project_id, job_id=None, multiple_jobs=False)[source]¶
Wait for Dataflow job.
- Parameters
job_name (str) – The ‘jobName’ to use when executing the DataFlow job (templated). This ends up being set in the pipeline options, so any entry with key
'jobName'
inoptions
will be overwritten.location (str) – location the job is running
project_id (str) – Optional, the Google Cloud project ID in which to start a job. If set to None or missing, the default project_id from the Google Cloud connection is used.
job_id (Optional[str]) – a Dataflow job ID
multiple_jobs (bool) – If pipeline creates multiple jobs then monitor all jobs