This documentation describes the integration of MindsDB with Databricks, the world’s first data intelligence platform powered by generative AI. The integration allows MindsDB to access data stored in a Databricks workspace and enhance it with AI capabilities.
This data source integration is thread-safe, utilizing a connection pool where each thread is assigned its own connection. When handling requests in parallel, threads retrieve connections from the pool as needed.
Before proceeding, ensure the following prerequisites are met:
If the Databricks cluster you are attempting to connect to is terminated, executing the queries given below will attempt to start the cluster and therefore, the first query may take a few minutes to execute.
To avoid any delays, ensure that the Databricks cluster is running before executing the queries.
Establish a connection to your Databricks workspace from MindsDB by executing the following SQL command:
Required connection parameters include the following:
server_hostname
: The server hostname for the cluster or SQL warehouse.http_path
: The HTTP path of the cluster or SQL warehouse.access_token
: A Databricks personal access token for the workspace.Refer the instructions given https://docs.databricks.com/en/integrations/compute-details.html and https://docs.databricks.com/en/dev-tools/python-sql-connector.html#authentication to find the connection parameters mentioned above for your compute resource.
Optional connection parameters include the following:
session_configuration
: Additional (key, value) pairs to set as Spark session configuration parameters. This should be provided as a JSON string.http_headers
: Additional (key, value) pairs to set in HTTP headers on every RPC request the client makes. This should be provided as a JSON string.catalog
: The catalog to use for the connection. Default is hive_metastore
.schema
: The schema (database) to use for the connection. Default is default
.Retrieve data from a specified table by providing the integration name, catalog, schema, and table name:
The catalog and schema names only need to be provided if the table to be queried is not in the specified (or default) catalog and schema.
Run Databricks SQL queries directly on the connected Databricks workspace:
The above examples utilize databricks_datasource
as the datasource name, which is defined in the CREATE DATABASE
command.
Database Connection Error
SQL statements running against tables (of reasonable size) are taking longer than expected.
SQL statement cannot be parsed by mindsdb_sql