airflow.providers.databricks.hooks.databricks

Databricks hook.

This hook enable the submitting and running of jobs to the Databricks platform. Internally the operators talk to the api/2.1/jobs/run-now endpoint <https://docs.databricks.com/dev-tools/api/latest/jobs.html#operation/JobsRunNow>_ or the api/2.1/jobs/runs/submit endpoint.

Module Contents

Classes

RunState

Utility class for the run state concept of Databricks runs.

DatabricksHook

Interact with Databricks.

Attributes

airflow.providers.databricks.hooks.databricks.RESTART_CLUSTER_ENDPOINT = ['POST', 'api/2.0/clusters/restart'][source]
airflow.providers.databricks.hooks.databricks.START_CLUSTER_ENDPOINT = ['POST', 'api/2.0/clusters/start'][source]
airflow.providers.databricks.hooks.databricks.TERMINATE_CLUSTER_ENDPOINT = ['POST', 'api/2.0/clusters/delete'][source]
airflow.providers.databricks.hooks.databricks.RUN_NOW_ENDPOINT = ['POST', 'api/2.1/jobs/run-now'][source]
airflow.providers.databricks.hooks.databricks.SUBMIT_RUN_ENDPOINT = ['POST', 'api/2.1/jobs/runs/submit'][source]
airflow.providers.databricks.hooks.databricks.GET_RUN_ENDPOINT = ['GET', 'api/2.1/jobs/runs/get'][source]
airflow.providers.databricks.hooks.databricks.CANCEL_RUN_ENDPOINT = ['POST', 'api/2.1/jobs/runs/cancel'][source]
airflow.providers.databricks.hooks.databricks.OUTPUT_RUNS_JOB_ENDPOINT = ['GET', 'api/2.1/jobs/runs/get-output'][source]
airflow.providers.databricks.hooks.databricks.INSTALL_LIBS_ENDPOINT = ['POST', 'api/2.0/libraries/install'][source]
airflow.providers.databricks.hooks.databricks.UNINSTALL_LIBS_ENDPOINT = ['POST', 'api/2.0/libraries/uninstall'][source]
airflow.providers.databricks.hooks.databricks.LIST_JOBS_ENDPOINT = ['GET', 'api/2.1/jobs/list'][source]
airflow.providers.databricks.hooks.databricks.WORKSPACE_GET_STATUS_ENDPOINT = ['GET', 'api/2.0/workspace/get-status'][source]
airflow.providers.databricks.hooks.databricks.RUN_LIFE_CYCLE_STATES = ['PENDING', 'RUNNING', 'TERMINATING', 'TERMINATED', 'SKIPPED', 'INTERNAL_ERROR'][source]
class airflow.providers.databricks.hooks.databricks.RunState(life_cycle_state, result_state='', state_message='', *args, **kwargs)[source]

Utility class for the run state concept of Databricks runs.

property is_terminal(self)[source]

True if the current state is a terminal state.

property is_successful(self)[source]

True if the result state is SUCCESS

__eq__(self, other)[source]

Return self==value.

__repr__(self)[source]

Return repr(self).

to_json(self)[source]
classmethod from_json(cls, data)[source]
class airflow.providers.databricks.hooks.databricks.DatabricksHook(databricks_conn_id=BaseDatabricksHook.default_conn_name, timeout_seconds=180, retry_limit=3, retry_delay=1.0, retry_args=None)[source]

Bases: airflow.providers.databricks.hooks.databricks_base.BaseDatabricksHook

Interact with Databricks.

Parameters
  • databricks_conn_id (str) -- Reference to the Databricks connection.

  • timeout_seconds (int) -- The amount of time in seconds the requests library will wait before timing-out.

  • retry_limit (int) -- The number of times to retry the connection in case of service outages.

  • retry_delay (float) -- The number of seconds to wait between retries (it might be a floating point number).

  • retry_args (Optional[Dict[Any, Any]]) -- An optional dictionary with arguments passed to tenacity.Retrying class.

hook_name = Databricks[source]
run_now(self, json)[source]

Utility function to call the api/2.0/jobs/run-now endpoint.

Parameters

json (dict) -- The data used in the body of the request to the run-now endpoint.

Returns

the run_id as an int

Return type

str

submit_run(self, json)[source]

Utility function to call the api/2.0/jobs/runs/submit endpoint.

Parameters

json (dict) -- The data used in the body of the request to the submit endpoint.

Returns

the run_id as an int

Return type

str

list_jobs(self, limit=25, offset=0, expand_tasks=False)[source]

Lists the jobs in the Databricks Job Service.

Parameters
  • limit (int) -- The limit/batch size used to retrieve jobs.

  • offset (int) -- The offset of the first job to return, relative to the most recently created job.

  • expand_tasks (bool) -- Whether to include task and cluster details in the response.

Returns

A list of jobs.

Return type

List[Dict[str, Any]]

find_job_id_by_name(self, job_name)[source]

Finds job id by its name. If there are multiple jobs with the same name, raises AirflowException.

Parameters

job_name (str) -- The name of the job to look up.

Returns

The job_id as an int or None if no job was found.

Return type

Optional[int]

get_run_page_url(self, run_id)[source]

Retrieves run_page_url.

Parameters

run_id (int) -- id of the run

Returns

URL of the run page

Return type

str

async a_get_run_page_url(self, run_id)[source]

Async version of get_run_page_url(). :param run_id: id of the run :return: URL of the run page

get_job_id(self, run_id)[source]

Retrieves job_id from run_id.

Parameters

run_id (int) -- id of the run

Returns

Job id for given Databricks run

Return type

int

get_run_state(self, run_id)[source]

Retrieves run state of the run.

Please note that any Airflow tasks that call the get_run_state method will result in failure unless you have enabled xcom pickling. This can be done using the following environment variable: AIRFLOW__CORE__ENABLE_XCOM_PICKLING

If you do not want to enable xcom pickling, use the get_run_state_str method to get a string describing state, or get_run_state_lifecycle, get_run_state_result, or get_run_state_message to get individual components of the run state.

Parameters

run_id (int) -- id of the run

Returns

state of the run

Return type

RunState

async a_get_run_state(self, run_id)[source]

Async version of get_run_state(). :param run_id: id of the run :return: state of the run

get_run_state_str(self, run_id)[source]

Return the string representation of RunState.

Parameters

run_id (int) -- id of the run

Returns

string describing run state

Return type

str

get_run_state_lifecycle(self, run_id)[source]

Returns the lifecycle state of the run

Parameters

run_id (int) -- id of the run

Returns

string with lifecycle state

Return type

str

get_run_state_result(self, run_id)[source]

Returns the resulting state of the run

Parameters

run_id (int) -- id of the run

Returns

string with resulting state

Return type

str

get_run_state_message(self, run_id)[source]

Returns the state message for the run

Parameters

run_id (int) -- id of the run

Returns

string with state message

Return type

str

get_run_output(self, run_id)[source]

Retrieves run output of the run.

Parameters

run_id (int) -- id of the run

Returns

output of the run

Return type

dict

cancel_run(self, run_id)[source]

Cancels the run.

Parameters

run_id (int) -- id of the run

restart_cluster(self, json)[source]

Restarts the cluster.

Parameters

json (dict) -- json dictionary containing cluster specification.

start_cluster(self, json)[source]

Starts the cluster.

Parameters

json (dict) -- json dictionary containing cluster specification.

terminate_cluster(self, json)[source]

Terminates the cluster.

Parameters

json (dict) -- json dictionary containing cluster specification.

install(self, json)[source]

Install libraries on the cluster.

Utility function to call the 2.0/libraries/install endpoint.

Parameters

json (dict) -- json dictionary containing cluster_id and an array of library

uninstall(self, json)[source]

Uninstall libraries on the cluster.

Utility function to call the 2.0/libraries/uninstall endpoint.

Parameters

json (dict) -- json dictionary containing cluster_id and an array of library

update_repo(self, repo_id, json)[source]

Updates given Databricks Repos

Parameters
  • repo_id (str) -- ID of Databricks Repos

  • json (Dict[str, Any]) -- payload

Returns

metadata from update

Return type

dict

delete_repo(self, repo_id)[source]

Deletes given Databricks Repos

Parameters

repo_id (str) -- ID of Databricks Repos

Returns

create_repo(self, json)[source]

Creates a Databricks Repos

Parameters

json (Dict[str, Any]) -- payload

Returns

Return type

dict

get_repo_by_path(self, path)[source]

Obtains Repos ID by path :param path: path to a repository :return: Repos ID if it exists, None if doesn't.

Was this entry helpful?