Google Cloud Dataproc Metastore Operators¶
Dataproc Metastore is a fully managed, highly available, auto-healing serverless Apache Hive metastore (HMS) that runs on Google Cloud. It supports HMS, serves as a critical component for managing the metadata of relational entities, and provides interoperability between data processing applications in the open source data ecosystem.
For more information about the service visit Dataproc Metastore production documentation <Product documentation
Create a Service¶
Before you create a dataproc metastore service you need to define the service. For more information about the available fields to pass when creating a service, visit Dataproc Metastore create service API.
A simple service configuration can look as followed:
SERVICE = {
"name": "test-service",
}
With this configuration we can create the service:
DataprocMetastoreCreateServiceOperator
create_service = DataprocMetastoreCreateServiceOperator(
task_id="create_service",
region=REGION,
project_id=PROJECT_ID,
service=SERVICE,
service_id=SERVICE_ID,
timeout=TIMEOUT,
)
Get a service¶
To get a service you can use:
DataprocMetastoreGetServiceOperator
get_service = DataprocMetastoreGetServiceOperator(
task_id="get_service",
region=REGION,
project_id=PROJECT_ID,
service_id=SERVICE_ID,
)
Update a service¶
You can update the service by providing a service config and an updateMask. In the updateMask argument you specifies the path, relative to Service, of the field to update. For more information on updateMask and other parameters take a look at Dataproc Metastore update service API.
An example of a new service config and the updateMask:
SERVICE_TO_UPDATE = {
"labels": {
"mylocalmachine": "mylocalmachine",
"systemtest": "systemtest",
}
}
UPDATE_MASK = FieldMask(paths=["labels"])
To update a service you can use:
DataprocMetastoreUpdateServiceOperator
update_service = DataprocMetastoreUpdateServiceOperator(
task_id="update_service",
project_id=PROJECT_ID,
service_id=SERVICE_ID,
region=REGION,
service=SERVICE_TO_UPDATE,
update_mask=UPDATE_MASK,
timeout=TIMEOUT,
)
Delete a service¶
To delete a service you can use:
DataprocMetastoreDeleteServiceOperator
delete_service = DataprocMetastoreDeleteServiceOperator(
task_id="delete_service",
region=REGION,
project_id=PROJECT_ID,
service_id=SERVICE_ID,
timeout=TIMEOUT,
)
Export a service metadata¶
To export metadata you can use:
DataprocMetastoreExportMetadataOperator
export_metadata = DataprocMetastoreExportMetadataOperator(
task_id="export_metadata",
destination_gcs_folder=DESTINATION_GCS_FOLDER,
project_id=PROJECT_ID,
region=REGION,
service_id=SERVICE_ID,
timeout=TIMEOUT,
)
Restore a service¶
To restore a service you can use:
DataprocMetastoreRestoreServiceOperator
restore_service = DataprocMetastoreRestoreServiceOperator(
task_id="restore_metastore",
region=REGION,
project_id=PROJECT_ID,
service_id=SERVICE_ID,
backup_id=BACKUP_ID,
backup_region=REGION,
backup_project_id=PROJECT_ID,
backup_service_id=SERVICE_ID,
timeout=TIMEOUT,
)
Create a metadata import¶
Before you create a dataproc metastore metadata import you need to define the metadata import. For more information about the available fields to pass when creating a metadata import, visit Dataproc Metastore create metadata import API.
A simple metadata import configuration can look as followed:
METADATA_IMPORT = {
"name": "test-metadata-import",
"database_dump": {
"gcs_uri": GCS_URI,
"database_type": DB_TYPE,
},
}
To create a metadata import you can use:
DataprocMetastoreCreateMetadataImportOperator
import_metadata = DataprocMetastoreCreateMetadataImportOperator(
task_id="import_metadata",
project_id=PROJECT_ID,
region=REGION,
service_id=SERVICE_ID,
metadata_import=METADATA_IMPORT,
metadata_import_id=METADATA_IMPORT_ID,
timeout=TIMEOUT,
)
Create a Backup¶
Before you create a dataproc metastore backup of the service you need to define the backup. For more information about the available fields to pass when creating a backup, visit Dataproc Metastore create backup API.
A simple backup configuration can look as followed:
BACKUP = {
"name": "test-backup",
}
With this configuration we can create the backup:
DataprocMetastoreCreateBackupOperator
backup_service = DataprocMetastoreCreateBackupOperator(
task_id="create_backup",
project_id=PROJECT_ID,
region=REGION,
service_id=SERVICE_ID,
backup=BACKUP,
backup_id=BACKUP_ID,
timeout=TIMEOUT,
)
Delete a backup¶
To delete a backup you can use:
DataprocMetastoreDeleteBackupOperator
delete_backup = DataprocMetastoreDeleteBackupOperator(
task_id="delete_backup",
project_id=PROJECT_ID,
region=REGION,
service_id=SERVICE_ID,
backup_id=BACKUP_ID,
timeout=TIMEOUT,
)
List backups¶
To list backups you can use:
DataprocMetastoreListBackupsOperator
list_backups = DataprocMetastoreListBackupsOperator(
task_id="list_backups",
project_id=PROJECT_ID,
region=REGION,
service_id=SERVICE_ID,
)
Check Hive partitions existence¶
To check that Hive partitions have been created in the Metastore for a given table you can use:
MetastoreHivePartitionSensor
sensor = MetastoreHivePartitionSensor(
task_id="hive_partition_sensor",
service_id=SERVICE_ID,
region=REGION,
table=TABLE_NAME,
partitions=[PARTITION_1, PARTITION_2],
)