airflow.providers.google.cloud.hooks.dataproc

This module contains a Google Cloud Dataproc hook.

Module Contents

class airflow.providers.google.cloud.hooks.dataproc.DataProcJobBuilder(project_id: str, task_id: str, cluster_name: str, job_type: str, properties: Optional[Dict[str, str]] = None)[source]

A helper class for building Dataproc job.

add_labels(self, labels: dict)[source]

Set labels for Dataproc job.

Parameters

labels (dict) – Labels for the job query.

add_variables(self, variables: List[str])[source]

Set variables for Dataproc job.

Parameters

variables (List[str]) – Variables for the job query.

add_args(self, args: List[str])[source]

Set args for Dataproc job.

Parameters

args (List[str]) – Args for the job query.

add_query(self, query: List[str])[source]

Set query uris for Dataproc job.

Parameters

query (List[str]) – URIs for the job queries.

add_query_uri(self, query_uri: str)[source]

Set query uri for Dataproc job.

Parameters

query_uri (str) – URI for the job query.

add_jar_file_uris(self, jars: List[str])[source]

Set jars uris for Dataproc job.

Parameters

jars (List[str]) – List of jars URIs

add_archive_uris(self, archives: List[str])[source]

Set archives uris for Dataproc job.

Parameters

archives (List[str]) – List of archives URIs

add_file_uris(self, files: List[str])[source]

Set file uris for Dataproc job.

Parameters

files (List[str]) – List of files URIs

add_python_file_uris(self, pyfiles: List[str])[source]

Set python file uris for Dataproc job.

Parameters

pyfiles (List[str]) – List of python files URIs

set_main(self, main_jar: Optional[str], main_class: Optional[str])[source]

Set Dataproc main class.

Parameters
  • main_jar (str) – URI for the main file.

  • main_class (str) – Name of the main class.

Raises

Exception

set_python_main(self, main: str)[source]

Set Dataproc main python file uri.

Parameters

main (str) – URI for the python main file.

set_job_name(self, name: str)[source]

Set Dataproc job name.

Parameters

name (str) – Job name.

build(self)[source]

Returns Dataproc job.

Returns

Dataproc job

Return type

dict

class airflow.providers.google.cloud.hooks.dataproc.DataprocHook[source]

Bases: airflow.providers.google.common.hooks.base_google.GoogleBaseHook

Hook for Google Cloud Dataproc APIs.

All the methods in the hook where project_id is used must be called with keyword arguments rather than positional.

get_cluster_client(self, location: Optional[str] = None)[source]

Returns ClusterControllerClient.

get_template_client(self, location: Optional[str] = None)[source]

Returns WorkflowTemplateServiceClient.

get_job_client(self, location: Optional[str] = None)[source]

Returns JobControllerClient.

create_cluster(self, region: str, project_id: str, cluster_name: str, cluster_config: Union[Dict, Cluster], labels: Optional[Dict[str, str]] = None, request_id: Optional[str] = None, retry: Optional[Retry] = None, timeout: Optional[float] = None, metadata: Optional[Sequence[Tuple[str, str]]] = None)[source]

Creates a cluster in a project.

Parameters
  • project_id (str) – Required. The ID of the Google Cloud project that the cluster belongs to.

  • region (str) – Required. The Cloud Dataproc region in which to handle the request.

  • cluster_name (str) – Name of the cluster to create

  • labels (Dict[str, str]) – Labels that will be assigned to created cluster

  • cluster_config (Union[Dict, google.cloud.dataproc_v1.types.ClusterConfig]) – Required. The cluster config to create. If a dict is provided, it must be of the same form as the protobuf message ClusterConfig

  • request_id (str) – Optional. A unique id used to identify the request. If the server receives two CreateClusterRequest requests with the same id, then the second request will be ignored and the first google.longrunning.Operation created and stored in the backend is returned.

  • retry (google.api_core.retry.Retry) – A retry object used to retry requests. If None is specified, requests will not be retried.

  • timeout (float) – The amount of time, in seconds, to wait for the request to complete. Note that if retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[Tuple[str, str]]) – Additional metadata that is provided to the method.

delete_cluster(self, region: str, cluster_name: str, project_id: str, cluster_uuid: Optional[str] = None, request_id: Optional[str] = None, retry: Optional[Retry] = None, timeout: Optional[float] = None, metadata: Optional[Sequence[Tuple[str, str]]] = None)[source]

Deletes a cluster in a project.

Parameters
  • project_id (str) – Required. The ID of the Google Cloud project that the cluster belongs to.

  • region (str) – Required. The Cloud Dataproc region in which to handle the request.

  • cluster_name (str) – Required. The cluster name.

  • cluster_uuid (str) – Optional. Specifying the cluster_uuid means the RPC should fail if cluster with specified UUID does not exist.

  • request_id (str) – Optional. A unique id used to identify the request. If the server receives two DeleteClusterRequest requests with the same id, then the second request will be ignored and the first google.longrunning.Operation created and stored in the backend is returned.

  • retry (google.api_core.retry.Retry) – A retry object used to retry requests. If None is specified, requests will not be retried.

  • timeout (float) – The amount of time, in seconds, to wait for the request to complete. Note that if retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[Tuple[str, str]]) – Additional metadata that is provided to the method.

diagnose_cluster(self, region: str, cluster_name: str, project_id: str, retry: Optional[Retry] = None, timeout: Optional[float] = None, metadata: Optional[Sequence[Tuple[str, str]]] = None)[source]

Gets cluster diagnostic information. After the operation completes GCS uri to diagnose is returned

Parameters
  • project_id (str) – Required. The ID of the Google Cloud project that the cluster belongs to.

  • region (str) – Required. The Cloud Dataproc region in which to handle the request.

  • cluster_name (str) – Required. The cluster name.

  • retry (google.api_core.retry.Retry) – A retry object used to retry requests. If None is specified, requests will not be retried.

  • timeout (float) – The amount of time, in seconds, to wait for the request to complete. Note that if retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[Tuple[str, str]]) – Additional metadata that is provided to the method.

get_cluster(self, region: str, cluster_name: str, project_id: str, retry: Optional[Retry] = None, timeout: Optional[float] = None, metadata: Optional[Sequence[Tuple[str, str]]] = None)[source]

Gets the resource representation for a cluster in a project.

Parameters
  • project_id (str) – Required. The ID of the Google Cloud project that the cluster belongs to.

  • region (str) – Required. The Cloud Dataproc region in which to handle the request.

  • cluster_name (str) – Required. The cluster name.

  • retry (google.api_core.retry.Retry) – A retry object used to retry requests. If None is specified, requests will not be retried.

  • timeout (float) – The amount of time, in seconds, to wait for the request to complete. Note that if retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[Tuple[str, str]]) – Additional metadata that is provided to the method.

list_clusters(self, region: str, filter_: str, project_id: str, page_size: Optional[int] = None, retry: Optional[Retry] = None, timeout: Optional[float] = None, metadata: Optional[Sequence[Tuple[str, str]]] = None)[source]

Lists all regions/{region}/clusters in a project.

Parameters
  • project_id (str) – Required. The ID of the Google Cloud project that the cluster belongs to.

  • region (str) – Required. The Cloud Dataproc region in which to handle the request.

  • filter (str) – Optional. A filter constraining the clusters to list. Filters are case-sensitive.

  • page_size (int) – The maximum number of resources contained in the underlying API response. If page streaming is performed per- resource, this parameter does not affect the return value. If page streaming is performed per-page, this determines the maximum number of resources in a page.

  • retry (google.api_core.retry.Retry) – A retry object used to retry requests. If None is specified, requests will not be retried.

  • timeout (float) – The amount of time, in seconds, to wait for the request to complete. Note that if retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[Tuple[str, str]]) – Additional metadata that is provided to the method.

update_cluster(self, location: str, cluster_name: str, cluster: Union[Dict, Cluster], update_mask: Union[Dict, FieldMask], project_id: str, graceful_decommission_timeout: Optional[Union[Dict, Duration]] = None, request_id: Optional[str] = None, retry: Optional[Retry] = None, timeout: Optional[float] = None, metadata: Optional[Sequence[Tuple[str, str]]] = None)[source]

Updates a cluster in a project.

Parameters
  • project_id (str) – Required. The ID of the Google Cloud project the cluster belongs to.

  • location (str) – Required. The Cloud Dataproc region in which to handle the request.

  • cluster_name (str) – Required. The cluster name.

  • cluster (Union[Dict, google.cloud.dataproc_v1.types.Cluster]) –

    Required. The changes to the cluster.

    If a dict is provided, it must be of the same form as the protobuf message Cluster

  • update_mask (Union[Dict, google.cloud.dataproc_v1.types.FieldMask]) –

    Required. Specifies the path, relative to Cluster, of the field to update. For example, to change the number of workers in a cluster to 5, the update_mask parameter would be specified as config.worker_config.num_instances, and the PATCH request body would specify the new value, as follows:

    { "config":{ "workerConfig":{ "numInstances":"5" } } }
    

    Similarly, to change the number of preemptible workers in a cluster to 5, the update_mask parameter would be config.secondary_worker_config.num_instances, and the PATCH request body would be set as follows:

    { "config":{ "secondaryWorkerConfig":{ "numInstances":"5" } } }
    

    If a dict is provided, it must be of the same form as the protobuf message FieldMask

  • graceful_decommission_timeout (Union[Dict, google.cloud.dataproc_v1.types.Duration]) –

    Optional. Timeout for graceful YARN decommissioning. Graceful decommissioning allows removing nodes from the cluster without interrupting jobs in progress. Timeout specifies how long to wait for jobs in progress to finish before forcefully removing nodes (and potentially interrupting jobs). Default timeout is 0 (for forceful decommission), and the maximum allowed timeout is 1 day.

    Only supported on Dataproc image versions 1.2 and higher.

    If a dict is provided, it must be of the same form as the protobuf message Duration

  • request_id (str) – Optional. A unique id used to identify the request. If the server receives two UpdateClusterRequest requests with the same id, then the second request will be ignored and the first google.longrunning.Operation created and stored in the backend is returned.

  • retry (google.api_core.retry.Retry) – A retry object used to retry requests. If None is specified, requests will not be retried.

  • timeout (float) – The amount of time, in seconds, to wait for the request to complete. Note that if retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[Tuple[str, str]]) – Additional metadata that is provided to the method.

create_workflow_template(self, location: str, template: Union[Dict, WorkflowTemplate], project_id: str, retry: Optional[Retry] = None, timeout: Optional[float] = None, metadata: Optional[Sequence[Tuple[str, str]]] = None)[source]

Creates new workflow template.

Parameters
  • project_id (str) – Required. The ID of the Google Cloud project the cluster belongs to.

  • location (str) – Required. The Cloud Dataproc region in which to handle the request.

  • template (Union[dict, WorkflowTemplate]) – The Dataproc workflow template to create. If a dict is provided, it must be of the same form as the protobuf message WorkflowTemplate.

  • retry (google.api_core.retry.Retry) – A retry object used to retry requests. If None is specified, requests will not be retried.

  • timeout (float) – The amount of time, in seconds, to wait for the request to complete. Note that if retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[Tuple[str, str]]) – Additional metadata that is provided to the method.

instantiate_workflow_template(self, location: str, template_name: str, project_id: str, version: Optional[int] = None, request_id: Optional[str] = None, parameters: Optional[Dict[str, str]] = None, retry: Optional[Retry] = None, timeout: Optional[float] = None, metadata: Optional[Sequence[Tuple[str, str]]] = None)[source]

Instantiates a template and begins execution.

Parameters
  • template_name (str) – Name of template to instantiate.

  • project_id (str) – Required. The ID of the Google Cloud project the cluster belongs to.

  • location (str) – Required. The Cloud Dataproc region in which to handle the request.

  • version (int) – Optional. The version of workflow template to instantiate. If specified, the workflow will be instantiated only if the current version of the workflow template has the supplied version. This option cannot be used to instantiate a previous version of workflow template.

  • request_id (str) – Optional. A tag that prevents multiple concurrent workflow instances with the same tag from running. This mitigates risk of concurrent instances started due to retries.

  • parameters (Dict[str, str]) – Optional. Map from parameter names to values that should be used for those parameters. Values may not exceed 100 characters.

  • retry (google.api_core.retry.Retry) – A retry object used to retry requests. If None is specified, requests will not be retried.

  • timeout (float) – The amount of time, in seconds, to wait for the request to complete. Note that if retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[Tuple[str, str]]) – Additional metadata that is provided to the method.

instantiate_inline_workflow_template(self, location: str, template: Union[Dict, WorkflowTemplate], project_id: str, request_id: Optional[str] = None, retry: Optional[Retry] = None, timeout: Optional[float] = None, metadata: Optional[Sequence[Tuple[str, str]]] = None)[source]

Instantiates a template and begins execution.

Parameters
  • template (Union[Dict, WorkflowTemplate]) – The workflow template to instantiate. If a dict is provided, it must be of the same form as the protobuf message WorkflowTemplate

  • project_id (str) – Required. The ID of the Google Cloud project the cluster belongs to.

  • location (str) – Required. The Cloud Dataproc region in which to handle the request.

  • request_id (str) – Optional. A tag that prevents multiple concurrent workflow instances with the same tag from running. This mitigates risk of concurrent instances started due to retries.

  • retry (google.api_core.retry.Retry) – A retry object used to retry requests. If None is specified, requests will not be retried.

  • timeout (float) – The amount of time, in seconds, to wait for the request to complete. Note that if retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[Tuple[str, str]]) – Additional metadata that is provided to the method.

wait_for_job(self, job_id: str, location: str, project_id: str, wait_time: int = 10, timeout: Optional[int] = None)[source]

Helper method which polls a job to check if it finishes.

Parameters
  • job_id (str) – Id of the Dataproc job

  • project_id (str) – Required. The ID of the Google Cloud project the cluster belongs to.

  • location (str) – Required. The Cloud Dataproc region in which to handle the request.

  • wait_time (int) – Number of seconds between checks

  • timeout (int) – How many seconds wait for job to be ready. Used only if asynchronous is False

get_job(self, location: str, job_id: str, project_id: str, retry: Optional[Retry] = None, timeout: Optional[float] = None, metadata: Optional[Sequence[Tuple[str, str]]] = None)[source]

Gets the resource representation for a job in a project.

Parameters
  • job_id (str) – Id of the Dataproc job

  • project_id (str) – Required. The ID of the Google Cloud project the cluster belongs to.

  • location (str) – Required. The Cloud Dataproc region in which to handle the request.

  • retry (google.api_core.retry.Retry) – A retry object used to retry requests. If None is specified, requests will not be retried.

  • timeout (float) – The amount of time, in seconds, to wait for the request to complete. Note that if retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[Tuple[str, str]]) – Additional metadata that is provided to the method.

submit_job(self, location: str, job: Union[dict, Job], project_id: str, request_id: Optional[str] = None, retry: Optional[Retry] = None, timeout: Optional[float] = None, metadata: Optional[Sequence[Tuple[str, str]]] = None)[source]

Submits a job to a cluster.

Parameters
  • job (Union[Dict, Job]) – The job resource. If a dict is provided, it must be of the same form as the protobuf message Job

  • project_id (str) – Required. The ID of the Google Cloud project the cluster belongs to.

  • location (str) – Required. The Cloud Dataproc region in which to handle the request.

  • request_id (str) – Optional. A tag that prevents multiple concurrent workflow instances with the same tag from running. This mitigates risk of concurrent instances started due to retries.

  • retry (google.api_core.retry.Retry) – A retry object used to retry requests. If None is specified, requests will not be retried.

  • timeout (float) – The amount of time, in seconds, to wait for the request to complete. Note that if retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[Tuple[str, str]]) – Additional metadata that is provided to the method.

submit(self, project_id: str, job: dict, region: str = 'global', job_error_states: Optional[Iterable[str]] = None)[source]

Submits Google Cloud Dataproc job.

Parameters
  • project_id (str) – The id of Google Cloud Dataproc project.

  • job (dict) – The job to be submitted

  • region (str) – The region of Google Dataproc cluster.

  • job_error_states (List[str]) – Job states that should be considered error states.

cancel_job(self, job_id: str, project_id: str, location: Optional[str] = None, retry: Optional[Retry] = None, timeout: Optional[float] = None, metadata: Optional[Sequence[Tuple[str, str]]] = None)[source]

Starts a job cancellation request.

Parameters
  • project_id (str) – Required. The ID of the Google Cloud project that the job belongs to.

  • location (str) – Required. The Cloud Dataproc region in which to handle the request.

  • job_id (str) – Required. The job ID.

  • retry (google.api_core.retry.Retry) – A retry object used to retry requests. If None is specified, requests will not be retried.

  • timeout (float) – The amount of time, in seconds, to wait for the request to complete. Note that if retry is specified, the timeout applies to each individual attempt.

  • metadata (Sequence[Tuple[str, str]]) – Additional metadata that is provided to the method.

Was this entry helpful?