airflow.providers.google.cloud.hooks.dataproc¶
This module contains a Google Cloud Dataproc hook.
Module Contents¶
Classes¶
| A helper class for building Dataproc job. | |
| Google Cloud Dataproc APIs. | |
| Asynchronous interaction with Google Cloud Dataproc APIs. | 
- class airflow.providers.google.cloud.hooks.dataproc.DataProcJobBuilder(project_id, task_id, cluster_name, job_type, properties=None)[source]¶
- A helper class for building Dataproc job. - add_labels(labels=None)[source]¶
- Set labels for Dataproc job. - Parameters
- labels (dict | None) – Labels for the job query. 
 
 - add_variables(variables=None)[source]¶
- Set variables for Dataproc job. - Parameters
- variables (dict | None) – Variables for the job query. 
 
 - add_query_uri(query_uri)[source]¶
- Set query uri for Dataproc job. - Parameters
- query_uri (str) – URI for the job query. 
 
 - set_python_main(main)[source]¶
- Set Dataproc main python file uri. - Parameters
- main (str) – URI for the python main file. 
 
 
- class airflow.providers.google.cloud.hooks.dataproc.DataprocHook(gcp_conn_id='google_cloud_default', impersonation_chain=None, **kwargs)[source]¶
- Bases: - airflow.providers.google.common.hooks.base_google.GoogleBaseHook- Google Cloud Dataproc APIs. - All the methods in the hook where project_id is used must be called with keyword arguments rather than positional. - wait_for_operation(operation, timeout=None, result_retry=DEFAULT)[source]¶
- Wait for a long-lasting operation to complete. 
 - create_cluster(region, project_id, cluster_name, cluster_config=None, virtual_cluster_config=None, labels=None, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
- Create a cluster in a specified project. - Parameters
- project_id (str) – Google Cloud project ID that the cluster belongs to. 
- region (str) – Cloud Dataproc region in which to handle the request. 
- cluster_name (str) – Name of the cluster to create. 
- labels (dict[str, str] | None) – Labels that will be assigned to created cluster. 
- cluster_config (dict | google.cloud.dataproc_v1.Cluster | None) – The cluster config to create. If a dict is provided, it must be of the same form as the protobuf message - ClusterConfig.
- virtual_cluster_config (dict | None) – The virtual cluster config, used when creating a Dataproc cluster that does not directly control the underlying compute resources, for example, when creating a Dataproc-on-GKE cluster with - VirtualClusterConfig.
- request_id (str | None) – A unique id used to identify the request. If the server receives two CreateClusterRequest requests with the same ID, the second request will be ignored, and an operation created for the first one and stored in the backend is returned. 
- retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried. 
- timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt. 
- metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method. 
 
 
 - delete_cluster(region, cluster_name, project_id, cluster_uuid=None, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
- Delete a cluster in a project. - Parameters
- project_id (str) – Google Cloud project ID that the cluster belongs to. 
- region (str) – Cloud Dataproc region in which to handle the request. 
- cluster_name (str) – Name of the cluster to delete. 
- cluster_uuid (str | None) – If specified, the RPC should fail if cluster with the UUID does not exist. 
- request_id (str | None) – A unique id used to identify the request. If the server receives two DeleteClusterRequest requests with the same ID, the second request will be ignored, and an operation created for the first one and stored in the backend is returned. 
- retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried. 
- timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt. 
- metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method. 
 
 
 - diagnose_cluster(region, cluster_name, project_id, tarball_gcs_dir=None, diagnosis_interval=None, jobs=None, yarn_application_ids=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
- Get cluster diagnostic information. - After the operation completes, the response contains the Cloud Storage URI of the diagnostic output report containing a summary of collected diagnostics. - Parameters
- project_id (str) – Google Cloud project ID that the cluster belongs to. 
- region (str) – Cloud Dataproc region in which to handle the request. 
- cluster_name (str) – Name of the cluster. 
- tarball_gcs_dir (str | None) – The output Cloud Storage directory for the diagnostic tarball. If not specified, a task-specific directory in the cluster’s staging bucket will be used. 
- diagnosis_interval (dict | google.type.interval_pb2.Interval | None) – Time interval in which diagnosis should be carried out on the cluster. 
- jobs (collections.abc.MutableSequence[str] | None) – Specifies a list of jobs on which diagnosis is to be performed. Format: projects/{project}/regions/{region}/jobs/{job} 
- yarn_application_ids (collections.abc.MutableSequence[str] | None) – Specifies a list of yarn applications on which diagnosis is to be performed. 
- retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried. 
- timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt. 
- metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method. 
 
 
 - get_cluster(region, cluster_name, project_id, retry=DEFAULT, timeout=None, metadata=())[source]¶
- Get the resource representation for a cluster in a project. - Parameters
- project_id (str) – Google Cloud project ID that the cluster belongs to. 
- region (str) – Cloud Dataproc region to handle the request. 
- cluster_name (str) – The cluster name. 
- retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried. 
- timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt. 
- metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method. 
 
 
 - list_clusters(region, filter_, project_id, page_size=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
- List all regions/{region}/clusters in a project. - Parameters
- project_id (str) – Google Cloud project ID that the cluster belongs to. 
- region (str) – Cloud Dataproc region to handle the request. 
- filter – To constrain the clusters to. Case-sensitive. 
- page_size (int | None) – The maximum number of resources contained in the underlying API response. If page streaming is performed per-resource, this parameter does not affect the return value. If page streaming is performed per-page, this determines the maximum number of resources in a page. 
- retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried. 
- timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt. 
- metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method. 
 
 
 - update_cluster(cluster_name, cluster, update_mask, project_id, region, graceful_decommission_timeout=None, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
- Update a cluster in a project. - Parameters
- project_id (str) – Google Cloud project ID that the cluster belongs to. 
- region (str) – Cloud Dataproc region to handle the request. 
- cluster_name (str) – The cluster name. 
- cluster (dict | google.cloud.dataproc_v1.Cluster) – Changes to the cluster. If a dict is provided, it must be of the same form as the protobuf message - Cluster.
- update_mask (dict | google.protobuf.field_mask_pb2.FieldMask) – - Specifies the path, relative to - Cluster, of the field to update. For example, to change the number of workers in a cluster to 5, this would be specified as- config.worker_config.num_instances, and the- PATCHrequest body would specify the new value:- {"config": {"workerConfig": {"numInstances": "5"}}} - Similarly, to change the number of preemptible workers in a cluster to 5, this would be - config.secondary_worker_config.num_instancesand the- PATCHrequest body would be:- {"config": {"secondaryWorkerConfig": {"numInstances": "5"}}} - If a dict is provided, it must be of the same form as the protobuf message - FieldMask.
- graceful_decommission_timeout (dict | google.protobuf.duration_pb2.Duration | None) – - Timeout for graceful YARN decommissioning. Graceful decommissioning allows removing nodes from the cluster without interrupting jobs in progress. Timeout specifies how long to wait for jobs in progress to finish before forcefully removing nodes (and potentially interrupting jobs). Default timeout is 0 (for forceful decommission), and the maximum allowed timeout is one day. - Only supported on Dataproc image versions 1.2 and higher. - If a dict is provided, it must be of the same form as the protobuf message - Duration.
- request_id (str | None) – A unique id used to identify the request. If the server receives two UpdateClusterRequest requests with the same ID, the second request will be ignored, and an operation created for the first one and stored in the backend is returned. 
- retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried. 
- timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt. 
- metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method. 
 
 
 - start_cluster(region, project_id, cluster_name, cluster_uuid=None, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
- Start a cluster in a project. - Parameters
- region (str) – Cloud Dataproc region to handle the request. 
- project_id (str) – Google Cloud project ID that the cluster belongs to. 
- cluster_name (str) – The cluster name. 
- cluster_uuid (str | None) – The cluster UUID 
- request_id (str | None) – A unique id used to identify the request. If the server receives two UpdateClusterRequest requests with the same ID, the second request will be ignored, and an operation created for the first one and stored in the backend is returned. 
- retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried. 
- timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt. 
- metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method. 
 
- Returns
- An instance of - google.api_core.operation.Operation
- Return type
 
 - stop_cluster(region, project_id, cluster_name, cluster_uuid=None, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
- Start a cluster in a project. - Parameters
- region (str) – Cloud Dataproc region to handle the request. 
- project_id (str) – Google Cloud project ID that the cluster belongs to. 
- cluster_name (str) – The cluster name. 
- cluster_uuid (str | None) – The cluster UUID 
- request_id (str | None) – A unique id used to identify the request. If the server receives two UpdateClusterRequest requests with the same ID, the second request will be ignored, and an operation created for the first one and stored in the backend is returned. 
- retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried. 
- timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt. 
- metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method. 
 
- Returns
- An instance of - google.api_core.operation.Operation
- Return type
 
 - create_workflow_template(template, project_id, region, retry=DEFAULT, timeout=None, metadata=())[source]¶
- Create a new workflow template. - Parameters
- project_id (str) – Google Cloud project ID that the cluster belongs to. 
- region (str) – Cloud Dataproc region to handle the request. 
- template (dict | google.cloud.dataproc_v1.WorkflowTemplate) – The Dataproc workflow template to create. If a dict is provided, it must be of the same form as the protobuf message WorkflowTemplate. 
- retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried. 
- timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt. 
- metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method. 
 
 
 - instantiate_workflow_template(template_name, project_id, region, version=None, request_id=None, parameters=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
- Instantiate a template and begins execution. - Parameters
- template_name (str) – Name of template to instantiate. 
- project_id (str) – Google Cloud project ID that the cluster belongs to. 
- region (str) – Cloud Dataproc region to handle the request. 
- version (int | None) – Version of workflow template to instantiate. If specified, the workflow will be instantiated only if the current version of the workflow template has the supplied version. This option cannot be used to instantiate a previous version of workflow template. 
- request_id (str | None) – A tag that prevents multiple concurrent workflow instances with the same tag from running. This mitigates risk of concurrent instances started due to retries. 
- parameters (dict[str, str] | None) – Map from parameter names to values that should be used for those parameters. Values may not exceed 100 characters. 
- retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried. 
- timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt. 
- metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method. 
 
 
 - instantiate_inline_workflow_template(template, project_id, region, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
- Instantiate a template and begin execution. - Parameters
- template (dict | google.cloud.dataproc_v1.WorkflowTemplate) – The workflow template to instantiate. If a dict is provided, it must be of the same form as the protobuf message WorkflowTemplate. 
- project_id (str) – Google Cloud project ID that the cluster belongs to. 
- region (str) – Cloud Dataproc region to handle the request. 
- request_id (str | None) – A tag that prevents multiple concurrent workflow instances with the same tag from running. This mitigates risk of concurrent instances started due to retries. 
- retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried. 
- timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt. 
- metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method. 
 
 
 - wait_for_job(job_id, project_id, region, wait_time=10, timeout=None)[source]¶
- Poll a job to check if it has finished. 
 - get_job(job_id, project_id, region, retry=DEFAULT, timeout=None, metadata=())[source]¶
- Get the resource representation for a job in a project. - Parameters
- job_id (str) – Dataproc job ID. 
- project_id (str) – Google Cloud project ID that the cluster belongs to. 
- region (str) – Cloud Dataproc region to handle the request. 
- retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried. 
- timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt. 
- metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method. 
 
 
 - submit_job(job, project_id, region, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
- Submit a job to a cluster. - Parameters
- job (dict | google.cloud.dataproc_v1.Job) – The job resource. If a dict is provided, it must be of the same form as the protobuf message Job. 
- project_id (str) – Google Cloud project ID that the cluster belongs to. 
- region (str) – Cloud Dataproc region to handle the request. 
- request_id (str | None) – A tag that prevents multiple concurrent workflow instances with the same tag from running. This mitigates risk of concurrent instances started due to retries. 
- retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried. 
- timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt. 
- metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method. 
 
 
 - cancel_job(job_id, project_id, region=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
- Start a job cancellation request. - Parameters
- project_id (str) – Google Cloud project ID that the cluster belongs to. 
- region (str | None) – Cloud Dataproc region to handle the request. 
- job_id (str) – The job ID. 
- retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried. 
- timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt. 
- metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method. 
 
 
 - create_batch(region, project_id, batch, batch_id=None, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
- Create a batch workload. - Parameters
- project_id (str) – Google Cloud project ID that the cluster belongs to. 
- region (str) – Cloud Dataproc region to handle the request. 
- batch (dict | google.cloud.dataproc_v1.Batch) – The batch to create. 
- batch_id (str | None) – The ID to use for the batch, which will become the final component of the batch’s resource name. This value must be of 4-63 characters. Valid characters are - [a-z][0-9]-.
- request_id (str | None) – A unique id used to identify the request. If the server receives two CreateBatchRequest requests with the same ID, the second request will be ignored, and an operation created for the first one and stored in the backend is returned. 
- retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried. 
- timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt. 
- metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method. 
 
 
 - delete_batch(batch_id, region, project_id, retry=DEFAULT, timeout=None, metadata=())[source]¶
- Delete the batch workload resource. - Parameters
- batch_id (str) – The batch ID. 
- project_id (str) – Google Cloud project ID that the cluster belongs to. 
- region (str) – Cloud Dataproc region to handle the request. 
- retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried. 
- timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt. 
- metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method. 
 
 
 - get_batch(batch_id, region, project_id, retry=DEFAULT, timeout=None, metadata=())[source]¶
- Get the batch workload resource representation. - Parameters
- batch_id (str) – The batch ID. 
- project_id (str) – Google Cloud project ID that the cluster belongs to. 
- region (str) – Cloud Dataproc region to handle the request. 
- retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried. 
- timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt. 
- metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method. 
 
 
 - list_batches(region, project_id, page_size=None, page_token=None, retry=DEFAULT, timeout=None, metadata=(), filter=None, order_by=None)[source]¶
- List batch workloads. - Parameters
- project_id (str) – Google Cloud project ID that the cluster belongs to. 
- region (str) – Cloud Dataproc region to handle the request. 
- page_size (int | None) – The maximum number of batches to return in each response. The service may return fewer than this value. The default page size is 20; the maximum page size is 1000. 
- page_token (str | None) – A page token received from a previous - ListBatchescall. Provide this token to retrieve the subsequent page.
- retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried. 
- timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt. 
- metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method. 
- filter (str | None) – Result filters as specified in ListBatchesRequest 
- order_by (str | None) – How to order results as specified in ListBatchesRequest 
 
 
 - wait_for_batch(batch_id, region, project_id, wait_check_interval=10, retry=DEFAULT, timeout=None, metadata=())[source]¶
- Wait for a batch job to complete. - After submission of a batch job, the operator waits for the job to complete. This hook is, however, useful in the case when Airflow is restarted or the task pid is killed for any reason. In this case, the creation would happen again, catching the raised AlreadyExists, and fail to this function for waiting on completion. - Parameters
- batch_id (str) – The batch ID. 
- region (str) – Cloud Dataproc region to handle the request. 
- project_id (str) – Google Cloud project ID that the cluster belongs to. 
- wait_check_interval (int) – The amount of time to pause between checks for job completion. 
- retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried. 
- timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt. 
- metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method. 
 
 
 
- class airflow.providers.google.cloud.hooks.dataproc.DataprocAsyncHook(gcp_conn_id='google_cloud_default', impersonation_chain=None, **kwargs)[source]¶
- Bases: - airflow.providers.google.common.hooks.base_google.GoogleBaseHook- Asynchronous interaction with Google Cloud Dataproc APIs. - All the methods in the hook where project_id is used must be called with keyword arguments rather than positional. - async create_cluster(region, project_id, cluster_name, cluster_config=None, virtual_cluster_config=None, labels=None, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
- Create a cluster in a project. - Parameters
- project_id (str) – Google Cloud project ID that the cluster belongs to. 
- region (str) – Cloud Dataproc region in which to handle the request. 
- cluster_name (str) – Name of the cluster to create. 
- labels (dict[str, str] | None) – Labels that will be assigned to created cluster. 
- cluster_config (dict | google.cloud.dataproc_v1.Cluster | None) – The cluster config to create. If a dict is provided, it must be of the same form as the protobuf message - ClusterConfig.
- virtual_cluster_config (dict | None) – The virtual cluster config, used when creating a Dataproc cluster that does not directly control the underlying compute resources, for example, when creating a Dataproc-on-GKE cluster with - VirtualClusterConfig.
- request_id (str | None) – A unique id used to identify the request. If the server receives two CreateClusterRequest requests with the same ID, the second request will be ignored, and an operation created for the first one and stored in the backend is returned. 
- retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried. 
- timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt. 
- metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method. 
 
 
 - async delete_cluster(region, cluster_name, project_id, cluster_uuid=None, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
- Delete a cluster in a project. - Parameters
- project_id (str) – Google Cloud project ID that the cluster belongs to. 
- region (str) – Cloud Dataproc region in which to handle the request. 
- cluster_name (str) – Name of the cluster to delete. 
- cluster_uuid (str | None) – If specified, the RPC should fail if cluster with the UUID does not exist. 
- request_id (str | None) – A unique id used to identify the request. If the server receives two DeleteClusterRequest requests with the same ID, the second request will be ignored, and an operation created for the first one and stored in the backend is returned. 
- retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried. 
- timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt. 
- metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method. 
 
 
 - async diagnose_cluster(region, cluster_name, project_id, tarball_gcs_dir=None, diagnosis_interval=None, jobs=None, yarn_application_ids=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
- Get cluster diagnostic information. - After the operation completes, the response contains the Cloud Storage URI of the diagnostic output report containing a summary of collected diagnostics. - Parameters
- project_id (str) – Google Cloud project ID that the cluster belongs to. 
- region (str) – Cloud Dataproc region in which to handle the request. 
- cluster_name (str) – Name of the cluster. 
- tarball_gcs_dir (str | None) – The output Cloud Storage directory for the diagnostic tarball. If not specified, a task-specific directory in the cluster’s staging bucket will be used. 
- diagnosis_interval (dict | google.type.interval_pb2.Interval | None) – Time interval in which diagnosis should be carried out on the cluster. 
- jobs (collections.abc.MutableSequence[str] | None) – Specifies a list of jobs on which diagnosis is to be performed. Format: projects/{project}/regions/{region}/jobs/{job} 
- yarn_application_ids (collections.abc.MutableSequence[str] | None) – Specifies a list of yarn applications on which diagnosis is to be performed. 
- retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried. 
- timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt. 
- metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method. 
 
 
 - async get_cluster(region, cluster_name, project_id, retry=DEFAULT, timeout=None, metadata=())[source]¶
- Get the resource representation for a cluster in a project. - Parameters
- project_id (str) – Google Cloud project ID that the cluster belongs to. 
- region (str) – Cloud Dataproc region to handle the request. 
- cluster_name (str) – The cluster name. 
- retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried. 
- timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt. 
- metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method. 
 
 
 - async list_clusters(region, filter_, project_id, page_size=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
- List all regions/{region}/clusters in a project. - Parameters
- project_id (str) – Google Cloud project ID that the cluster belongs to. 
- region (str) – Cloud Dataproc region to handle the request. 
- filter – To constrain the clusters to. Case-sensitive. 
- page_size (int | None) – The maximum number of resources contained in the underlying API response. If page streaming is performed per-resource, this parameter does not affect the return value. If page streaming is performed per-page, this determines the maximum number of resources in a page. 
- retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried. 
- timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt. 
- metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method. 
 
 
 - async update_cluster(cluster_name, cluster, update_mask, project_id, region, graceful_decommission_timeout=None, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
- Update a cluster in a project. - Parameters
- project_id (str) – Google Cloud project ID that the cluster belongs to. 
- region (str) – Cloud Dataproc region to handle the request. 
- cluster_name (str) – The cluster name. 
- cluster (dict | google.cloud.dataproc_v1.Cluster) – Changes to the cluster. If a dict is provided, it must be of the same form as the protobuf message - Cluster.
- update_mask (dict | google.protobuf.field_mask_pb2.FieldMask) – - Specifies the path, relative to - Cluster, of the field to update. For example, to change the number of workers in a cluster to 5, this would be specified as- config.worker_config.num_instances, and the- PATCHrequest body would specify the new value:- {"config": {"workerConfig": {"numInstances": "5"}}} - Similarly, to change the number of preemptible workers in a cluster to 5, this would be - config.secondary_worker_config.num_instancesand the- PATCHrequest body would be:- {"config": {"secondaryWorkerConfig": {"numInstances": "5"}}} - If a dict is provided, it must be of the same form as the protobuf message - FieldMask.
- graceful_decommission_timeout (dict | google.protobuf.duration_pb2.Duration | None) – - Timeout for graceful YARN decommissioning. Graceful decommissioning allows removing nodes from the cluster without interrupting jobs in progress. Timeout specifies how long to wait for jobs in progress to finish before forcefully removing nodes (and potentially interrupting jobs). Default timeout is 0 (for forceful decommission), and the maximum allowed timeout is one day. - Only supported on Dataproc image versions 1.2 and higher. - If a dict is provided, it must be of the same form as the protobuf message - Duration.
- request_id (str | None) – A unique id used to identify the request. If the server receives two UpdateClusterRequest requests with the same ID, the second request will be ignored, and an operation created for the first one and stored in the backend is returned. 
- retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried. 
- timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt. 
- metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method. 
 
 
 - async create_workflow_template(template, project_id, region, retry=DEFAULT, timeout=None, metadata=())[source]¶
- Create a new workflow template. - Parameters
- project_id (str) – Google Cloud project ID that the cluster belongs to. 
- region (str) – Cloud Dataproc region to handle the request. 
- template (dict | google.cloud.dataproc_v1.WorkflowTemplate) – The Dataproc workflow template to create. If a dict is provided, it must be of the same form as the protobuf message WorkflowTemplate. 
- retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried. 
- timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt. 
- metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method. 
 
 
 - async instantiate_workflow_template(template_name, project_id, region, version=None, request_id=None, parameters=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
- Instantiate a template and begins execution. - Parameters
- template_name (str) – Name of template to instantiate. 
- project_id (str) – Google Cloud project ID that the cluster belongs to. 
- region (str) – Cloud Dataproc region to handle the request. 
- version (int | None) – Version of workflow template to instantiate. If specified, the workflow will be instantiated only if the current version of the workflow template has the supplied version. This option cannot be used to instantiate a previous version of workflow template. 
- request_id (str | None) – A tag that prevents multiple concurrent workflow instances with the same tag from running. This mitigates risk of concurrent instances started due to retries. 
- parameters (dict[str, str] | None) – Map from parameter names to values that should be used for those parameters. Values may not exceed 100 characters. 
- retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried. 
- timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt. 
- metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method. 
 
 
 - async instantiate_inline_workflow_template(template, project_id, region, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
- Instantiate a template and begin execution. - Parameters
- template (dict | google.cloud.dataproc_v1.WorkflowTemplate) – The workflow template to instantiate. If a dict is provided, it must be of the same form as the protobuf message WorkflowTemplate. 
- project_id (str) – Google Cloud project ID that the cluster belongs to. 
- region (str) – Cloud Dataproc region to handle the request. 
- request_id (str | None) – A tag that prevents multiple concurrent workflow instances with the same tag from running. This mitigates risk of concurrent instances started due to retries. 
- retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried. 
- timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt. 
- metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method. 
 
 
 - async get_job(job_id, project_id, region, retry=DEFAULT, timeout=None, metadata=())[source]¶
- Get the resource representation for a job in a project. - Parameters
- job_id (str) – Dataproc job ID. 
- project_id (str) – Google Cloud project ID that the cluster belongs to. 
- region (str) – Cloud Dataproc region to handle the request. 
- retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried. 
- timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt. 
- metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method. 
 
 
 - async submit_job(job, project_id, region, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
- Submit a job to a cluster. - Parameters
- job (dict | google.cloud.dataproc_v1.Job) – The job resource. If a dict is provided, it must be of the same form as the protobuf message Job. 
- project_id (str) – Google Cloud project ID that the cluster belongs to. 
- region (str) – Cloud Dataproc region to handle the request. 
- request_id (str | None) – A tag that prevents multiple concurrent workflow instances with the same tag from running. This mitigates risk of concurrent instances started due to retries. 
- retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried. 
- timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt. 
- metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method. 
 
 
 - async cancel_job(job_id, project_id, region=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
- Start a job cancellation request. - Parameters
- project_id (str) – Google Cloud project ID that the cluster belongs to. 
- region (str | None) – Cloud Dataproc region to handle the request. 
- job_id (str) – The job ID. 
- retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried. 
- timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt. 
- metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method. 
 
 
 - async create_batch(region, project_id, batch, batch_id=None, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
- Create a batch workload. - Parameters
- project_id (str) – Google Cloud project ID that the cluster belongs to. 
- region (str) – Cloud Dataproc region to handle the request. 
- batch (dict | google.cloud.dataproc_v1.Batch) – The batch to create. 
- batch_id (str | None) – The ID to use for the batch, which will become the final component of the batch’s resource name. This value must be of 4-63 characters. Valid characters are - [a-z][0-9]-.
- request_id (str | None) – A unique id used to identify the request. If the server receives two CreateBatchRequest requests with the same ID, the second request will be ignored, and an operation created for the first one and stored in the backend is returned. 
- retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried. 
- timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt. 
- metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method. 
 
 
 - async delete_batch(batch_id, region, project_id, retry=DEFAULT, timeout=None, metadata=())[source]¶
- Delete the batch workload resource. - Parameters
- batch_id (str) – The batch ID. 
- project_id (str) – Google Cloud project ID that the cluster belongs to. 
- region (str) – Cloud Dataproc region to handle the request. 
- retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried. 
- timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt. 
- metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method. 
 
 
 - async get_batch(batch_id, region, project_id, retry=DEFAULT, timeout=None, metadata=())[source]¶
- Get the batch workload resource representation. - Parameters
- batch_id (str) – The batch ID. 
- project_id (str) – Google Cloud project ID that the cluster belongs to. 
- region (str) – Cloud Dataproc region to handle the request. 
- retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried. 
- timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt. 
- metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method. 
 
 
 - async list_batches(region, project_id, page_size=None, page_token=None, retry=DEFAULT, timeout=None, metadata=(), filter=None, order_by=None)[source]¶
- List batch workloads. - Parameters
- project_id (str) – Google Cloud project ID that the cluster belongs to. 
- region (str) – Cloud Dataproc region to handle the request. 
- page_size (int | None) – The maximum number of batches to return in each response. The service may return fewer than this value. The default page size is 20; the maximum page size is 1000. 
- page_token (str | None) – A page token received from a previous - ListBatchescall. Provide this token to retrieve the subsequent page.
- retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried. 
- timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt. 
- metadata (Sequence[tuple[str, str]]) – Additional metadata that is provided to the method. 
- filter (str | None) – Result filters as specified in ListBatchesRequest 
- order_by (str | None) – How to order results as specified in ListBatchesRequest 
 
 
 
