A knowledge base is an advanced system that organizes information based on semantic meaning rather than simple keyword matching. It integrates embedding models, reranking models, and vector stores to enable context-aware data retrieval.

CREATE KNOWLEDGE_BASE Syntax

Here is the syntax for creating a knowledge base:

CREATE KNOWLEDGE_BASE my_kb
USING
    embedding_model = {
        "provider": "openai",
        "model_name" : "text-embedding-3-large",
        "api_key": "sk-..."
    },
    reranking_model = {
        "provider": "openai",
        "model_name": "gpt-4o",
        "api_key": "sk-..."
    },
    storage = my_vector_store.storage_table,
    metadata_columns = ['date', 'creator', ...],
    content_columns = ['review', 'content', ...],
    id_column = 'id';

Upon execution, it registers my_kb and associates the specified models and storage. my_kb is a unique identifier of the knowledge base within MindsDB.

Users can use the variables and the from_env() function to pass parameters when creating knowledge bases.

As MindsDB stores objects, such as models or knowledge bases, inside projects, you can create a knowledge base inside a custom project.

CREATE PROJECT my_project;

CREATE KNOWLEDGE_BASE my_project.my_kb
USING
    ...

Supported LLMs

Below is the list of all language models supported for the embedding_model and reranking_model parameters.

provider = 'openai'

When choosing openai as the model provider, users should define the following model parameters.

  • model_name stores the name of the OpenAI model to be used.
  • api_key stores the OpenAI API key.

Learn more about the OpenAI integration with MindsDB here.

provider = 'openai_azure'

When choosing openai_azure as the model provider, users should define the following model parameters.

  • model_name stores the name of the OpenAI model to be used.
  • api_key stores the OpenAI API key.
  • base_url stores the base URL of the Azure instance.
  • api_version stores the version of the Azure instance.

provider = 'bedrock'

When choosing bedrock as the model provider, users should define the following model parameters.

  • model_name stores the name of the model available via Amazon Bedrock.
  • aws_access_key_id stores a unique identifier associated with your AWS account, used to identify the user or application making requests to AWS.
  • aws_region_name stores the name of the AWS region you want to send your requests to (e.g., "us-west-2").
  • aws_secret_access_key stores the secret key associated with your AWS access key ID. It is used to sign your requests securely.
  • aws_session_token stores a temporary token used for short-term security credentials when using AWS Identity and Access Management (IAM) roles or temporary credentials.

provider = 'snowflake'

When choosing snowflake as the model provider, users should choose one of the available models from Snowflake Cortex AI and define the following model parameters.

  • model_name stores the name of the model available via Snowflake Cortex AI.
  • api_key stores the Snowflake Cortex AI API key.
  • snowflake_account_id stores the Snowflake account ID.

embedding_model

The embedding model is a required component of the knowledge base. It stores specifications of the embedding model to be used.

Users can define the embedding model choosing one of the following options.

Option 1. Use the embedding_model parameter to define the specification.

CREATE KNOWLEDGE_BASE my_kb
USING
    ...
    embedding_model = {

        "provider": "azure_openai",
        "model_name" : "text-embedding-3-large",
        "api_key": "sk-abc123",
        "base_url": "https://ai-6689.openai.azure.com/",
        "api_version": "2024-02-01"

    },
    ...

Option 2. Define the default embedding model in the MindsDB configuration file.

You can define the default models in the Settings of the MindsDB Editor GUI.

Note that if you define default_embedding_model in the configuration file, you do not need to provide the embedding_model parameter when creating a knowledge base. If provide both, then the values from the embedding_model parameter are used.

"default_embedding_model": {

   "provider": "azure_openai",
   "model_name" : "text-embedding-3-large",
   "api_key": "sk-abc123",
   "base_url": "https://ai-6689.openai.azure.com/",
   "api_version": "2024-02-01"

}

The embedding model specification includes:

  • provider It is a required parameter. It defines the model provider.

  • model_name It is a required parameter. It defines the embedding model name as specified by the provider.

  • api_key The API key is required to access the embedding model assigned to a knowledge base. Users can provide it either in this api_key parameter, or in the OPENAI_API_KEY environment variable for "provider": "openai" and AZURE_OPENAI_API_KEY environment variable for "provider": "azure_openai".

  • base_url It is an optional parameter, which defaults to https://api.openai.com/v1/. It is a required parameter when using the azure_openai provider. It is the root URL used to send API requests.

  • api_version It is an optional parameter. It is a required parameter when using the azure_openai provider. It defines the API version.

reranking_model

The reranking model is an optional component of the knowledge base. It stores specifications of the reranking model to be used.

Users can disable reranking features of knowledge bases by setting this parameter to false.

CREATE KNOWLEDGE_BASE my_kb
USING
    ...
    reranking_model = false,
    ...

Users can enable reranking features of knowledge bases by defining the reranking model choosing one of the following options.

Option 1. Use the reranking_model parameter to define the specification.

CREATE KNOWLEDGE_BASE my_kb
USING
    ...
    reranking_model = {

        "provider": "azure_openai",
        "model_name" : "gpt-4o",
        "api_key": "sk-abc123",
        "base_url": "https://ai-6689.openai.azure.com/",
        "api_version": "2024-02-01",
        "method": "multi-class"

    },
    ...

Option 2. Define the default reranking model in the MindsDB configuration file.

You can define the default models in the Settings of the MindsDB Editor GUI.

Note that if you define default_reranking_model in the configuration file, you do not need to provide the reranking_model parameter when creating a knowledge base. If provide both, then the values from the reranking_model parameter are used.

"default_reranking_model": {

  "provider": "azure_openai",
  "model_name" : "gpt-4o",
  "api_key": "sk-abc123",
  "base_url": "https://ai-6689.openai.azure.com/",
  "api_version": "2024-02-01",
  "method": "multi-class"

}

The reranking model specification includes:

  • provider It is a required parameter. It defines the model provider as listed in supported LLMs.

  • model_name It is a required parameter. It defines the embedding model name as specified by the provider.

  • api_key The API key is required to access the embedding model assigned to a knowledge base. Users can provide it either in this api_key parameter, or in the OPENAI_API_KEY environment variable for "provider": "openai" and AZURE_OPENAI_API_KEY environment variable for "provider": "azure_openai".

  • base_url It is an optional parameter, which defaults to https://api.openai.com/v1/. It is a required parameter when using the azure_openai provider. It is the root URL used to send API requests.

  • api_version It is an optional parameter. It is a required parameter when using the azure_openai provider. It defines the API version.

  • method It is an optional parameter. It defines the method used to calculate the relevance of the output rows. The available options include multi-class and binary. It defaults to multi-class.

Reranking Method

The multi-class reranking method classifies each document chunk (that meets any specified metadata filtering conditions) into one of four relevance classes:

  1. Not relevant with class weight of 0.25.
  2. Slightly relevant with class weight of 0.5.
  3. Moderately relevant with class weight of 0.75.
  4. Highly relevant with class weight of 1.

The overall relevance_score of a document is calculated as the sum of each chunk’s class weight multiplied by its class probability (from model logprob output).

The binary reranking method simplifies classification by determining whether a document is relevant or not, without intermediate relevance levels. With this method, the overall relevance_score of a document is calculated based on the model log probability.

storage

The vector store is a required component of the knowledge base. It stores data in the form of embeddings.

It is optional for users to provide the storage parameter. If not provided, the default ChromaDB is created when creating a knowledge base.

The available options include either PGVector or ChromaDB.

It is recommended to use PGVector version 0.8.0 or higher for a better performance.

If the storage parameter is not provided, the system creates the default ChromaDB vector database called <kb_name>_chromadb with the default table called default_collection that stores the embedded data. This default ChromaDB vector database is stored in MindsDB’s storage.

In order to provide the storage vector database, it is required to connect it to MindsDB beforehand.

Here is an example for PGVector.

CREATE DATABASE my_pgvector
WITH ENGINE = 'pgvector',
PARAMETERS = {
    "host": "127.0.0.1",
    "port": 5432,
    "database": "postgres",
    "user": "user",
    "password": "password",
    "distance": "cosine"
};

CREATE KNOWLEDGE_BASE my_kb
USING
    ...
    storage = my_pgvector.storage_table,
    ...

Note that you do not need to have the storage_table created as it is created when creating a knowledge base.

metadata_columns

The data inserted into the knowledge base can be classified as metadata, which enables users to filter the search results using defined data fields.

Note that source data column(s) included in metadata_columns cannot be used in content_columns, and vice versa.

This parameter is an array of strings that lists column names from the source data to be used as metadata. If not provided, then all inserted columns (except for columns defined as id_column and content_columns) are considered metadata columns.

Here is an example of usage. A user wants to store the following data in a knowledge base.

+----------+-------------------+------------------------+
| order_id | product           | notes                  |
+----------+-------------------+------------------------+
| A1B      | Wireless Mouse    | Request color: black   |
| 3XZ      | Bluetooth Speaker | Gift wrap requested    |
| Q7P      | Laptop Stand      | Prefer aluminum finish |
+----------+-------------------+------------------------+

Go to the Complete Example section below to find out how to access this sample data.

The product column can be used as metadata to enable metadata filtering.

CREATE KNOWLEDGE_BASE my_kb
USING
    ...
    metadata_columns = ['product'],
    ...

content_columns

The data inserted into the knowledge base can be classified as content, which is embedded by the embedding model and stored in the underlying vector store.

Note that source data column(s) included in content_columns cannot be used in metadata_columns, and vice versa.

This parameter is an array of strings that lists column names from the source data to be used as content and processed into embeddings. If not provided, the content column is expected by default when inserting data into the knowledge base.

Here is an example of usage. A user wants to store the following data in a knowledge base.

+----------+-------------------+------------------------+
| order_id | product           | notes                  |
+----------+-------------------+------------------------+
| A1B      | Wireless Mouse    | Request color: black   |
| 3XZ      | Bluetooth Speaker | Gift wrap requested    |
| Q7P      | Laptop Stand      | Prefer aluminum finish |
+----------+-------------------+------------------------+

Go to the Complete Example section below to find out how to access this sample data.

The notes column can be used as content.

CREATE KNOWLEDGE_BASE my_kb
USING
    ...
    content_columns = ['notes'],
    ...

id_column

The ID column uniquely identifies each source data row in the knowledge base.

It is an optional parameter. If provided, this parameter is a string that contains the source data ID column name. If not provided, it is generated from the hash of the content columns.

Here is an example of usage. A user wants to store the following data in a knowledge base.

+----------+-------------------+------------------------+
| order_id | product           | notes                  |
+----------+-------------------+------------------------+
| A1B      | Wireless Mouse    | Request color: black   |
| 3XZ      | Bluetooth Speaker | Gift wrap requested    |
| Q7P      | Laptop Stand      | Prefer aluminum finish |
+----------+-------------------+------------------------+

Go to the Complete Example section below to find out how to access this sample data.

The order_id column can be used as ID.

CREATE KNOWLEDGE_BASE my_kb
USING
    ...
    id_column = 'order_id',
    ...

Note that if the source data row is chunked into multiple chunks by the knowledge base (that is, to optimize the storage), then these rows in the knowledge base have the same ID value that identifies chunks from one source data row.

Available options for the ID column values

  • User-Defined ID Column:
    When users defined the id_column parameter, the values from the provided source data column are used to identify source data rows within the knowledge base.

  • User-Generated ID Column:
    When users do not have a column that uniquely identifies each row in their source data, they can generate the ID column values when inserting data into the knowledge base using functions like HASH() or ROW_NUMBER().

INSERT INTO my_kb (
    SELECT ROW_NUMBER() OVER (ORDER BY order_id) AS id, *
    FROM sample_data.orders
);
  • Default ID Column:
    If the id_column parameter is not defined, its default values are build from the hash of the content columns and follow the format: <first 16 char of md5 hash of row content>.

Example

Here is a sample knowledge base that will be used for examples in the following content.

CREATE KNOWLEDGE_BASE my_kb
USING
    embedding_model = {
        "provider": "openai",
        "model_name" : "text-embedding-3-large",
        "api_key": "sk-abc123"
    },
    reranking_model = {
        "provider": "openai",
        "model_name": "gpt-4o",
        "api_key": "sk-abc123"
    },
    metadata_columns = ['product'],
    content_columns = ['notes'],
    id_column = 'order_id';

DESCRIBE KNOWLEDGE_BASE Syntax

Users can get details about the knowledge base using the DESCRIBE KNOWLEDGE_BASE command.

DESCRIBE KNOWLEDGE_BASE my_kb;

Here is the sample output:

+---------+---------+--------+----------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+--------------------+----------------+-------+----------+
| NAME    | PROJECT | MODEL  | STORAGE                                | PARAMS                                                                                                                                                                                                                                       | INSERT_STARTED_AT | INSERT_FINISHED_AT | PROCESSED_ROWS | ERROR | QUERY_ID |
+---------+---------+--------+----------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+--------------------+----------------+-------+----------+
| my_kb   | mindsdb | [NULL] | my_kb_chromadb.default_collection      | {"embedding_model": {"provider": "openai", "model_name": "text-embedding-ada-002", "api_key": "sk-xxx"}, "reranking_model": {"provider": "openai", "model_name": "gpt-4o", "api_key": "sk-xxx"}, "default_vector_storage": "my_kb_chromadb"} | [NULL]            | [NULL]             | [NULL]         | [NULL]| [NULL]   |
+---------+---------+--------+----------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+--------------------+----------------+-------+----------+

DROP KNOWLEDGE_BASE Syntax

Here is the syntax for deleting a knowledge base:

DROP KNOWLEDGE_BASE my_kb;

Upon execution, it removes the knowledge base with its content.

Upon execution, it identifies matching records based on the user-defined condition and removes all associated data (metadata, content, chunks, embeddings) for matching records from the KB’s storage.