airflow.providers.databricks.hooks.databricks

Databricks hook.

This hook enable the submitting and running of jobs to the Databricks platform. Internally the operators talk to the api/2.1/jobs/run-now endpoint <https://docs.databricks.com/dev-tools/api/latest/jobs.html#operation/JobsRunNow>_ or the api/2.1/jobs/runs/submit endpoint.

Module Contents

Classes

RunState

Utility class for the run state concept of Databricks runs.

DatabricksHook

Interact with Databricks.

Attributes

RESTART_CLUSTER_ENDPOINT

START_CLUSTER_ENDPOINT

TERMINATE_CLUSTER_ENDPOINT

RUN_NOW_ENDPOINT

SUBMIT_RUN_ENDPOINT

GET_RUN_ENDPOINT

CANCEL_RUN_ENDPOINT

DELETE_RUN_ENDPOINT

REPAIR_RUN_ENDPOINT

OUTPUT_RUNS_JOB_ENDPOINT

CANCEL_ALL_RUNS_ENDPOINT

INSTALL_LIBS_ENDPOINT

UNINSTALL_LIBS_ENDPOINT

LIST_JOBS_ENDPOINT

WORKSPACE_GET_STATUS_ENDPOINT

RUN_LIFE_CYCLE_STATES

SPARK_VERSIONS_ENDPOINT

airflow.providers.databricks.hooks.databricks.RESTART_CLUSTER_ENDPOINT = ('POST', 'api/2.0/clusters/restart')[source]
airflow.providers.databricks.hooks.databricks.START_CLUSTER_ENDPOINT = ('POST', 'api/2.0/clusters/start')[source]
airflow.providers.databricks.hooks.databricks.TERMINATE_CLUSTER_ENDPOINT = ('POST', 'api/2.0/clusters/delete')[source]
airflow.providers.databricks.hooks.databricks.RUN_NOW_ENDPOINT = ('POST', 'api/2.1/jobs/run-now')[source]
airflow.providers.databricks.hooks.databricks.SUBMIT_RUN_ENDPOINT = ('POST', 'api/2.1/jobs/runs/submit')[source]
airflow.providers.databricks.hooks.databricks.GET_RUN_ENDPOINT = ('GET', 'api/2.1/jobs/runs/get')[source]
airflow.providers.databricks.hooks.databricks.CANCEL_RUN_ENDPOINT = ('POST', 'api/2.1/jobs/runs/cancel')[source]
airflow.providers.databricks.hooks.databricks.DELETE_RUN_ENDPOINT = ('POST', 'api/2.1/jobs/runs/delete')[source]
airflow.providers.databricks.hooks.databricks.REPAIR_RUN_ENDPOINT = ('POST', 'api/2.1/jobs/runs/repair')[source]
airflow.providers.databricks.hooks.databricks.OUTPUT_RUNS_JOB_ENDPOINT = ('GET', 'api/2.1/jobs/runs/get-output')[source]
airflow.providers.databricks.hooks.databricks.CANCEL_ALL_RUNS_ENDPOINT = ('POST', 'api/2.1/jobs/runs/cancel-all')[source]
airflow.providers.databricks.hooks.databricks.INSTALL_LIBS_ENDPOINT = ('POST', 'api/2.0/libraries/install')[source]
airflow.providers.databricks.hooks.databricks.UNINSTALL_LIBS_ENDPOINT = ('POST', 'api/2.0/libraries/uninstall')[source]
airflow.providers.databricks.hooks.databricks.LIST_JOBS_ENDPOINT = ('GET', 'api/2.1/jobs/list')[source]
airflow.providers.databricks.hooks.databricks.WORKSPACE_GET_STATUS_ENDPOINT = ('GET', 'api/2.0/workspace/get-status')[source]
airflow.providers.databricks.hooks.databricks.RUN_LIFE_CYCLE_STATES = ['PENDING', 'RUNNING', 'TERMINATING', 'TERMINATED', 'SKIPPED', 'INTERNAL_ERROR'][source]
airflow.providers.databricks.hooks.databricks.SPARK_VERSIONS_ENDPOINT = ('GET', 'api/2.0/clusters/spark-versions')[source]
class airflow.providers.databricks.hooks.databricks.RunState(life_cycle_state, result_state='', state_message='', *args, **kwargs)[source]

Utility class for the run state concept of Databricks runs.

property is_terminal: bool[source]

True if the current state is a terminal state.

property is_successful: bool[source]

True if the result state is SUCCESS

__eq__(other)[source]

Return self==value.

__repr__()[source]

Return repr(self).

to_json()[source]
classmethod from_json(data)[source]
class airflow.providers.databricks.hooks.databricks.DatabricksHook(databricks_conn_id=BaseDatabricksHook.default_conn_name, timeout_seconds=180, retry_limit=3, retry_delay=1.0, retry_args=None, caller='DatabricksHook')[source]

Bases: airflow.providers.databricks.hooks.databricks_base.BaseDatabricksHook

Interact with Databricks.

Parameters
  • databricks_conn_id (str) – Reference to the Databricks connection.

  • timeout_seconds (int) – The amount of time in seconds the requests library will wait before timing-out.

  • retry_limit (int) – The number of times to retry the connection in case of service outages.

  • retry_delay (float) – The number of seconds to wait between retries (it might be a floating point number).

  • retry_args (dict[Any, Any] | None) – An optional dictionary with arguments passed to tenacity.Retrying class.

hook_name = 'Databricks'[source]
run_now(json)[source]

Utility function to call the api/2.0/jobs/run-now endpoint.

Parameters

json (dict) – The data used in the body of the request to the run-now endpoint.

Returns

the run_id as an int

Return type

int

submit_run(json)[source]

Utility function to call the api/2.0/jobs/runs/submit endpoint.

Parameters

json (dict) – The data used in the body of the request to the submit endpoint.

Returns

the run_id as an int

Return type

int

list_jobs(limit=25, offset=0, expand_tasks=False, job_name=None)[source]

Lists the jobs in the Databricks Job Service.

Parameters
  • limit (int) – The limit/batch size used to retrieve jobs.

  • offset (int) – The offset of the first job to return, relative to the most recently created job.

  • expand_tasks (bool) – Whether to include task and cluster details in the response.

  • job_name (str | None) – Optional name of a job to search.

Returns

A list of jobs.

Return type

list[dict[str, Any]]

find_job_id_by_name(job_name)[source]

Finds job id by its name. If there are multiple jobs with the same name, raises AirflowException.

Parameters

job_name (str) – The name of the job to look up.

Returns

The job_id as an int or None if no job was found.

Return type

int | None

get_run_page_url(run_id)[source]

Retrieves run_page_url.

Parameters

run_id (int) – id of the run

Returns

URL of the run page

Return type

str

async a_get_run_page_url(run_id)[source]

Async version of get_run_page_url(). :param run_id: id of the run :return: URL of the run page

get_job_id(run_id)[source]

Retrieves job_id from run_id.

Parameters

run_id (int) – id of the run

Returns

Job id for given Databricks run

Return type

int

get_run_state(run_id)[source]

Retrieves run state of the run.

Please note that any Airflow tasks that call the get_run_state method will result in failure unless you have enabled xcom pickling. This can be done using the following environment variable: AIRFLOW__CORE__ENABLE_XCOM_PICKLING

If you do not want to enable xcom pickling, use the get_run_state_str method to get a string describing state, or get_run_state_lifecycle, get_run_state_result, or get_run_state_message to get individual components of the run state.

Parameters

run_id (int) – id of the run

Returns

state of the run

Return type

RunState

async a_get_run_state(run_id)[source]

Async version of get_run_state(). :param run_id: id of the run :return: state of the run

get_run(run_id)[source]

Retrieve run information.

Parameters

run_id (int) – id of the run

Returns

state of the run

Return type

dict[str, Any]

async a_get_run(run_id)[source]

Async version of get_run.

Parameters

run_id (int) – id of the run

Returns

state of the run

Return type

dict[str, Any]

get_run_state_str(run_id)[source]

Return the string representation of RunState.

Parameters

run_id (int) – id of the run

Returns

string describing run state

Return type

str

get_run_state_lifecycle(run_id)[source]

Returns the lifecycle state of the run

Parameters

run_id (int) – id of the run

Returns

string with lifecycle state

Return type

str

get_run_state_result(run_id)[source]

Returns the resulting state of the run

Parameters

run_id (int) – id of the run

Returns

string with resulting state

Return type

str

get_run_state_message(run_id)[source]

Returns the state message for the run

Parameters

run_id (int) – id of the run

Returns

string with state message

Return type

str

get_run_output(run_id)[source]

Retrieves run output of the run.

Parameters

run_id (int) – id of the run

Returns

output of the run

Return type

dict

cancel_run(run_id)[source]

Cancels the run.

Parameters

run_id (int) – id of the run

cancel_all_runs(job_id)[source]

Cancels all active runs of a job. The runs are canceled asynchronously.

Parameters

job_id (int) – The canonical identifier of the job to cancel all runs of

delete_run(run_id)[source]

Deletes a non-active run.

Parameters

run_id (int) – id of the run

repair_run(json)[source]

Re-run one or more tasks.

Parameters

json (dict) – repair a job run.

restart_cluster(json)[source]

Restarts the cluster.

Parameters

json (dict) – json dictionary containing cluster specification.

start_cluster(json)[source]

Starts the cluster.

Parameters

json (dict) – json dictionary containing cluster specification.

terminate_cluster(json)[source]

Terminates the cluster.

Parameters

json (dict) – json dictionary containing cluster specification.

install(json)[source]

Install libraries on the cluster.

Utility function to call the 2.0/libraries/install endpoint.

Parameters

json (dict) – json dictionary containing cluster_id and an array of library

uninstall(json)[source]

Uninstall libraries on the cluster.

Utility function to call the 2.0/libraries/uninstall endpoint.

Parameters

json (dict) – json dictionary containing cluster_id and an array of library

update_repo(repo_id, json)[source]

Updates given Databricks Repos

Parameters
  • repo_id (str) – ID of Databricks Repos

  • json (dict[str, Any]) – payload

Returns

metadata from update

Return type

dict

delete_repo(repo_id)[source]

Deletes given Databricks Repos

Parameters

repo_id (str) – ID of Databricks Repos

Returns

create_repo(json)[source]

Creates a Databricks Repos

Parameters

json (dict[str, Any]) – payload

Returns

Return type

dict

get_repo_by_path(path)[source]

Obtains Repos ID by path :param path: path to a repository :return: Repos ID if it exists, None if doesn’t.

test_connection()[source]

Test the Databricks connectivity from UI

Was this entry helpful?