Databricks
This documentation describes the integration of MindsDB with Databricks, the world’s first data intelligence platform powered by generative AI. The integration allows MindsDB to access data stored in a Databricks workspace and enhance it with AI capabilities.
Prerequisites
Before proceeding, ensure the following prerequisites are met:
- Install MindsDB locally via Docker or Docker Desktop.
- To connect Databricks to MindsDB, install the required dependencies following this instruction.
If the Databricks cluster you are attempting to connect to is terminated, executing the queries given below will attempt to start the cluster and therefore, the first query may take a few minutes to execute.
To avoid any delays, ensure that the Databricks cluster is running before executing the queries.
Connection
Establish a connection to your Databricks workspace from MindsDB by executing the following SQL command:
CREATE DATABASE databricks_datasource
WITH
engine = 'databricks',
parameters = {
"server_hostname": "adb-1234567890123456.7.azuredatabricks.net",
"http_path": "sql/protocolv1/o/1234567890123456/1234-567890-test123",
"access_token": "dapi1234567890ab1cde2f3ab456c7d89efa",
"schema": "example_db"
};
Required connection parameters include the following:
server_hostname
: The server hostname for the cluster or SQL warehouse.http_path
: The HTTP path of the cluster or SQL warehouse.access_token
: A Databricks personal access token for the workspace.
Refer the instructions given https://docs.databricks.com/en/integrations/compute-details.html and https://docs.databricks.com/en/dev-tools/python-sql-connector.html#authentication to find the connection parameters mentioned above for your compute resource.
Optional connection parameters include the following:
session_configuration
: Additional (key, value) pairs to set as Spark session configuration parameters. This should be provided as a JSON string.http_headers
: Additional (key, value) pairs to set in HTTP headers on every RPC request the client makes. This should be provided as a JSON string.catalog
: The catalog to use for the connection. Default ishive_metastore
.schema
: The schema (database) to use for the connection. Default isdefault
.
Usage
Retrieve data from a specified table by providing the integration name, catalog, schema, and table name:
SELECT *
FROM databricks_datasource.catalog_name.schema_name.table_name
LIMIT 10;
The catalog and schema names only need to be provided if the table to be queried is not in the specified (or default) catalog and schema.
Run Databricks SQL queries directly on the connected Databricks workspace:
SELECT * FROM databricks_datasource (
--Native Query Goes Here
SELECT
city,
car_model,
RANK() OVER (PARTITION BY car_model ORDER BY quantity) AS rank
FROM dealer
QUALIFY rank = 1;
);
The above examples utilize databricks_datasource
as the datasource name, which is defined in the CREATE DATABASE
command.
Troubleshooting Guide
Database Connection Error
- Symptoms: Failure to connect MindsDB with the Databricks workspace.
- Checklist:
- Make sure the Databricks workspace is active.
- Confirm that server hostname, HTTP path, access token are correctly provided. If the catalog and schema are provided, ensure they are correct as well.
- Ensure a stable network between MindsDB and Databricks workspace.
SQL statements running against tables (of reasonable size) are taking longer than expected.
- Symptoms: SQL queries taking longer than expected to execute.
- Checklist:
- Ensure the Databricks cluster is running before executing the queries.
- Check the network connection between MindsDB and Databricks workspace.
SQL statement cannot be parsed by mindsdb_sql
- Symptoms: SQL queries failing or not recognizing table names containing special characters.
- Checklist:
- Ensure table names with special characters are enclosed in backticks.
- Examples:
- Incorrect: SELECT * FROM integration.travel-data
- Incorrect: SELECT * FROM integration.‘travel-data’
- Correct: SELECT * FROM integration.`travel-data`