A knowledge base is an advanced system that organizes information based on semantic meaning rather than simple keyword matching. It integrates embedding models, reranking models, and vector stores to enable context-aware data retrieval.

By performing semantic reasoning across multiple data points, a knowledge base delivers deeper insights and more accurate responses, making it a powerful tool for intelligent data access.

How Knowledge Bases Work

Before diving into the syntax, here is a quick walkthrough showing how knowledge bases work in MindsDB.

We start by creating a knowledge base and inserting data. Next we can run semantic search queries with metadata filtering.

1

Create a knowledge base

Use the CREATE KNOWLEDGE_BASE command to create a knowledge base, specifying all its components.

CREATE KNOWLEDGE_BASE my_kb
USING
    embedding_model = {
        "provider": "openai",
        "model_name" : "text-embedding-3-large",
        "api_key": "sk-abc123"
    },
    reranking_model = {
        "provider": "openai",
        "model_name": "gpt-4o",
        "api_key": "sk-abc123"
    },
    metadata_columns = ['product'],
    content_columns = ['notes'],
    id_column = 'order_id';
2

Insert data into the knowledge base

In this example, we use a simple dataset containing customer notes for product orders which will be inserted into the knowledge base.

+----------+-----------------------+------------------------+
| order_id | product               | notes                  |
+----------+-----------------------+------------------------+
| A1B      | Wireless Mouse        | Request color: black   |
| 3XZ      | Bluetooth Speaker     | Gift wrap requested    |
| Q7P      | Aluminum Laptop Stand | Prefer aluminum finish |
+----------+-----------------------+------------------------+

Use the INSERT INTO command to ingest data into the knowledge base.

INSERT INTO my_kb
    SELECT order_id, product, notes
    FROM sample_data.orders;
3

Run semantic search on the knowledge base

Query the knowledge base using semantic search.

SELECT *
FROM my_kb
WHERE content = 'color'

This query returns:

+-----+----------------------+-------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------------------+
| id  | chunk_id             | chunk_content           | metadata                                                                                                                                                                                            | distance           | relevance          |
+-----+----------------------+-------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------------------+
| A1B | A1B_notes:1of1:0to20 | Request color: black    | {"chunk_index":0,"content_column":"notes","end_char":20,"original_doc_id":"A1B_notes","original_row_id":"A1B","product":"Wireless Mouse","source":"TextChunkingPreprocessor","start_char":0}        | 0.5743341242061104 | 0.5093188026135379 |
| Q7P | Q7P_notes:1of1:0to22 | Prefer aluminum finish  | {"chunk_index":0,"content_column":"notes","end_char":22,"original_doc_id":"Q7P_notes","original_row_id":"Q7P","product":"Aluminum Laptop Stand","source":"TextChunkingPreprocessor","start_char":0} | 0.7744703514692067 | 0.2502580835880018 |
| 3XZ | 3XZ_notes:1of1:0to19 | Gift wrap requested     | {"chunk_index":0,"content_column":"notes","end_char":19,"original_doc_id":"3XZ_notes","original_row_id":"3XZ","product":"Bluetooth Speaker","source":"TextChunkingPreprocessor","start_char":0}     | 0.8010851611432231 | 0.2500003885558766 |
+-----+----------------------+-------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------------------+
4

Get the most relevant search results

Query the knowledge base using semantic search and define the relevance_threshold parameter to receive only the best matching data for your use case.

SELECT *
FROM my_kb
WHERE content = 'color'
AND relevance >= 0.2502;

This query returns:

+-----+----------------------+-------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------------------+
| id  | chunk_id             | chunk_content           | metadata                                                                                                                                                                                            | distance           | relevance          |
+-----+----------------------+-------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------------------+
| A1B | A1B_notes:1of1:0to20 | Request color: black    | {"chunk_index":0,"content_column":"notes","end_char":20,"original_doc_id":"A1B_notes","original_row_id":"A1B","product":"Wireless Mouse","source":"TextChunkingPreprocessor","start_char":0}        | 0.5743341242061104 | 0.5093188026135379 |
| Q7P | Q7P_notes:1of1:0to22 | Prefer aluminum finish  | {"chunk_index":0,"content_column":"notes","end_char":22,"original_doc_id":"Q7P_notes","original_row_id":"Q7P","product":"Aluminum Laptop Stand","source":"TextChunkingPreprocessor","start_char":0} | 0.7744703514692067 | 0.2502580835880018 |
+-----+----------------------+-------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------------------+
5

Filter results by metadata

Add metadata filtering to focus your search.

SELECT *
FROM my_kb
WHERE product = 'Wireless Mouse'
AND content = 'color'
AND relevance >= 0.2502;

This query returns:

+-----+----------------------+------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+-------------------+
| id  | chunk_id             | chunk_content          | metadata                                                                                                                                                                                     | distance           | relevance         |
+-----+----------------------+------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+-------------------+
| A1B | A1B_notes:1of1:0to20 | Request color: black   | {"chunk_index":0,"content_column":"notes","end_char":20,"original_doc_id":"A1B_notes","original_row_id":"A1B","product":"Wireless Mouse","source":"TextChunkingPreprocessor","start_char":0} | 0.5743341242061104 | 0.504396172197583 |
+-----+----------------------+------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+-------------------+

The following sections explain the syntax and other features of knowledge bases.

CREATE KNOWLEDGE_BASE Syntax

Here is the syntax for creating a knowledge base:

CREATE KNOWLEDGE_BASE my_kb
USING
    embedding_model = {
        "provider": "openai",
        "model_name" : "text-embedding-3-large",
        "api_key": "sk-..."
    },
    reranking_model = {
        "provider": "openai",
        "model_name": "gpt-4o",
        "api_key": "sk-..."
    },
    storage = my_vector_store.storage_table,
    metadata_columns = ['date', 'creator', ...],
    content_columns = ['review', 'content', ...],
    id_column = 'id';

Upon execution, it registers my_kb and associates the specified models and storage. my_kb is a unique identifier of the knowledge base within MindsDB.

Users can use the variables and the from_env() function to pass parameters when creating knowledge bases.

As MindsDB stores objects, such as models or knowledge bases, inside projects, you can create a knowledge base inside a custom project.

CREATE PROJECT my_project;

CREATE KNOWLEDGE_BASE my_project.my_kb
USING
    ...

Supported LLMs

Below is the list of all language models supported for the embedding_model and reranking_model parameters.

provider = 'openai'

When choosing openai as the model provider, users should define the following model parameters.

  • model_name stores the name of the OpenAI model to be used.
  • api_key stores the OpenAI API key.

Learn more about the OpenAI integration with MindsDB here.

provider = 'openai_azure'

When choosing openai_azure as the model provider, users should define the following model parameters.

  • model_name stores the name of the OpenAI model to be used.
  • api_key stores the OpenAI API key.
  • base_url stores the base URL of the Azure instance.
  • api_version stores the version of the Azure instance.

provider = 'bedrock'

When choosing bedrock as the model provider, users should define the following model parameters.

  • model_name stores the name of the model available via Amazon Bedrock.
  • aws_access_key_id stores a unique identifier associated with your AWS account, used to identify the user or application making requests to AWS.
  • aws_region_name stores the name of the AWS region you want to send your requests to (e.g., "us-west-2").
  • aws_secret_access_key stores the secret key associated with your AWS access key ID. It is used to sign your requests securely.
  • aws_session_token stores a temporary token used for short-term security credentials when using AWS Identity and Access Management (IAM) roles or temporary credentials.

provider = 'snowflake'

When choosing snowflake as the model provider, users should define the following model parameters.

  • model_name stores the name of the model available via Snowflake Cortex AI.
  • api_key stores the Snowflake API key.
  • snowflake_account_id stores the Snowflake account ID.

provider = 'ollama'

When choosing ollama as the model provider, users should define the following model parameters.

  • model_name stores the name of the model available via Ollama.
  • base_url stores the base URL for accessing you Ollama instance.

Example:

CREATE KNOWLEDGE_BASE my_kb
        USING
        ...
        embedding_model = {
            "provider": "ollama",
            "model_name": "mxbai-embed-large",
            "base_url": "http://localhost:11434"
        },
        reranking_model = {
            "provider": "ollama",
            "model_name": "deepseek-r1:1.5b",
            "base_url": "http://localhost:11434"
        };

Learn more about the Ollama integration with MindsDB here.

embedding_model

The embedding model is a required component of the knowledge base. It stores specifications of the embedding model to be used.

Users can define the embedding model choosing one of the following options.

Option 1. Use the embedding_model parameter to define the specification.

CREATE KNOWLEDGE_BASE my_kb
USING
    ...
    embedding_model = {

        "provider": "azure_openai",
        "model_name" : "text-embedding-3-large",
        "api_key": "sk-abc123",
        "base_url": "https://ai-6689.openai.azure.com/",
        "api_version": "2024-02-01"

    },
    ...

Option 2. Define the default embedding model in the MindsDB configuration file.

Note that if you define default_embedding_model in the configuration file, you do not need to provide the embedding_model parameter when creating a knowledge base. If provide both, then the values from the embedding_model parameter are used.

"default_embedding_model": {

   "provider": "azure_openai",
   "model_name" : "text-embedding-3-large",
   "api_key": "sk-abc123",
   "base_url": "https://ai-6689.openai.azure.com/",
   "api_version": "2024-02-01"

}

The embedding model specification includes:

  • provider It is a required parameter. It defines the model provider. Currently, the supported providers include OpenAI (openai) and OpenAI via Azure (azure_openai).

  • model_name It is a required parameter. It defines the embedding model name as specified by the provider. Users can choose one of the OpenAI embedding models.

  • api_key The API key is required to access the embedding model assigned to a knowledge base. Users can provide it either in this api_key parameter, or in the OPENAI_API_KEY environment variable for "provider": "openai" and AZURE_OPENAI_API_KEY environment variable for "provider": "azure_openai".

  • base_url It is an optional parameter, which defaults to https://api.openai.com/v1/. It is a required parameter when using the azure_openai provider. It is the root URL used to send API requests.

  • api_version It is an optional parameter. It is a required parameter when using the azure_openai provider. It defines the API version.

reranking_model

The reranking model is an optional component of the knowledge base. It stores specifications of the reranking model to be used.

Users can define the reranking model choosing one of the following options.

Option 1. Use the reranking_model parameter to define the specification.

CREATE KNOWLEDGE_BASE my_kb
USING
    ...
    reranking_model = {

        "provider": "azure_openai",
        "model_name" : "gpt-4o",
        "api_key": "sk-abc123",
        "base_url": "https://ai-6689.openai.azure.com/",
        "api_version": "2024-02-01",
        "method": "multi-class"

    },
    ...

Option 2. Define the default reranking model in the MindsDB configuration file.

Note that if you define default_llm in the configuration file, you do not need to provide the reranking_model parameter when creating a knowledge base. If provide both, then the values from the reranking_model parameter are used.

"default_llm": {

  "provider": "azure_openai",
  "model_name" : "gpt-4o",
  "api_key": "sk-abc123",
  "base_url": "https://ai-6689.openai.azure.com/",
  "api_version": "2024-02-01",
  "method": "multi-class"

}

The reranking model specification includes:

  • provider It is a required parameter. It defines the model provider. Currently, the supported providers include OpenAI (openai) and OpenAI via Azure (azure_openai).

  • model_name It is a required parameter. It defines the embedding model name as specified by the provider. Users can choose one of the OpenAI chat models.

  • api_key The API key is required to access the embedding model assigned to a knowledge base. Users can provide it either in this api_key parameter, or in the OPENAI_API_KEY environment variable for "provider": "openai" and AZURE_OPENAI_API_KEY environment variable for "provider": "azure_openai".

  • base_url It is an optional parameter, which defaults to https://api.openai.com/v1/. It is a required parameter when using the azure_openai provider. It is the root URL used to send API requests.

  • api_version It is an optional parameter. It is a required parameter when using the azure_openai provider. It defines the API version.

  • method It is an optional parameter. It defines the method used to calculate the relevance of the output rows. The available options include multi-class and binary. It defaults to multi-class.

Reranking Method

The multi-class reranking method classifies each document chunk (that meets any specified metadata filtering conditions) into one of four relevance classes:

  1. Not relevant with class weight of 0.25.
  2. Slightly relevant with class weight of 0.5.
  3. Moderately relevant with class weight of 0.75.
  4. Highly relevant with class weight of 1.

The overall relevance_score of a document is calculated as the sum of each chunk’s class weight multiplied by its class probability (from model logprob output).

The binary reranking method simplifies classification by determining whether a document is relevant or not, without intermediate relevance levels. With this method, the overall relevance_score of a document is calculated based on the model log probability.

storage

The vector store is a required component of the knowledge base. It stores data in the form of embeddings.

It is optional for users to provide the storage parameter. If not provided, the default ChromaDB is created when creating a knowledge base.

The available options include either PGVector or ChromaDB.

It is recommended to use PGVector version 0.8.0 or higher for a better performance.

If the storage parameter is not provided, the system creates the default ChromaDB vector database called <kb_name>_chromadb with the default table called default_collection that stores the embedded data. This default ChromaDB vector database is stored in MindsDB’s storage.

In order to provide the storage vector database, it is required to connect it to MindsDB beforehand.

Here is an example for PGVector.

CREATE DATABASE my_pgvector
WITH ENGINE = 'pgvector',
PARAMETERS = {
    "host": "127.0.0.1",
    "port": 5432,
    "database": "postgres",
    "user": "user",
    "password": "password",
    "distance": "cosine"
};

CREATE KNOWLEDGE_BASE my_kb
USING
    ...
    storage = my_pgvector.storage_table,
    ...

Note that you do not need to have the storage_table created as it is created when creating a knowledge base.

metadata_columns

The data inserted into the knowledge base can be classified as metadata, which enables users to filter the search results using defined data fields.

Note that source data column(s) included in metadata_columns cannot be used in content_columns, and vice versa.

This parameter is an array of strings that lists column names from the source data to be used as metadata. If not provided, then all inserted columns (except for columns defined as id_column and content_columns) are considered metadata columns.

Here is an example of usage. A user wants to store the following data in a knowledge base.

+----------+-------------------+------------------------+
| order_id | product           | notes                  |
+----------+-------------------+------------------------+
| A1B      | Wireless Mouse    | Request color: black   |
| 3XZ      | Bluetooth Speaker | Gift wrap requested    |
| Q7P      | Laptop Stand      | Prefer aluminum finish |
+----------+-------------------+------------------------+

Go to the Complete Example section below to find out how to access this sample data.

The product column can be used as metadata to enable metadata filtering.

CREATE KNOWLEDGE_BASE my_kb
USING
    ...
    metadata_columns = ['product'],
    ...

content_columns

The data inserted into the knowledge base can be classified as content, which is embedded by the embedding model and stored in the underlying vector store.

Note that source data column(s) included in content_columns cannot be used in metadata_columns, and vice versa.

This parameter is an array of strings that lists column names from the source data to be used as content and processed into embeddings. If not provided, the content column is expected by default when inserting data into the knowledge base.

Here is an example of usage. A user wants to store the following data in a knowledge base.

+----------+-------------------+------------------------+
| order_id | product           | notes                  |
+----------+-------------------+------------------------+
| A1B      | Wireless Mouse    | Request color: black   |
| 3XZ      | Bluetooth Speaker | Gift wrap requested    |
| Q7P      | Laptop Stand      | Prefer aluminum finish |
+----------+-------------------+------------------------+

Go to the Complete Example section below to find out how to access this sample data.

The notes column can be used as content.

CREATE KNOWLEDGE_BASE my_kb
USING
    ...
    content_columns = ['notes'],
    ...

id_column

The ID column uniquely identifies each source data row in the knowledge base.

It is an optional parameter. If provided, this parameter is a string that contains the source data ID column name. If not provided, it is generated from the hash of the content columns.

Here is an example of usage. A user wants to store the following data in a knowledge base.

+----------+-------------------+------------------------+
| order_id | product           | notes                  |
+----------+-------------------+------------------------+
| A1B      | Wireless Mouse    | Request color: black   |
| 3XZ      | Bluetooth Speaker | Gift wrap requested    |
| Q7P      | Laptop Stand      | Prefer aluminum finish |
+----------+-------------------+------------------------+

Go to the Complete Example section below to find out how to access this sample data.

The order_id column can be used as ID.

CREATE KNOWLEDGE_BASE my_kb
USING
    ...
    id_column = 'order_id',
    ...

Note that if the source data row is chunked into multiple chunks by the knowledge base (that is, to optimize the storage), then these rows in the knowledge base have the same ID value that identifies chunks from one source data row.

Available options for the ID column values

  • User-Defined ID Column:
    When users defined the id_column parameter, the values from the provided source data column are used to identify source data rows within the knowledge base.

  • User-Generated ID Column:
    When users do not have a column that uniquely identifies each row in their source data, they can generate the ID column values when inserting data into the knowledge base using functions like HASH() or ROW_NUMBER().

INSERT INTO my_kb (
    SELECT ROW_NUMBER() OVER (ORDER BY order_id) AS id, *
    FROM sample_data.orders
);
  • Default ID Column:
    If the id_column parameter is not defined, its default values are build from the hash of the content columns and follow the format: <first 16 char of md5 hash of row content>.

Example

Here is a sample knowledge base that will be used for examples in the following content.

CREATE KNOWLEDGE_BASE my_kb
USING
    embedding_model = {
        "provider": "openai",
        "model_name" : "text-embedding-3-large",
        "api_key": "sk-abc123"
    },
    reranking_model = {
        "provider": "openai",
        "model_name": "gpt-4o",
        "api_key": "sk-abc123"
    },
    metadata_columns = ['product'],
    content_columns = ['notes'],
    id_column = 'order_id';

DESCRIBE KNOWLEDGE_BASE Syntax

Users can get details about the knowledge base using the DESCRIBE KNOWLEDGE_BASE command.

DESCRIBE KNOWLEDGE_BASE my_kb;

Here is the sample output:

+---------+---------+--------+----------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+--------------------+----------------+-------+----------+
| NAME    | PROJECT | MODEL  | STORAGE                                | PARAMS                                                                                                                                                                                                                                       | INSERT_STARTED_AT | INSERT_FINISHED_AT | PROCESSED_ROWS | ERROR | QUERY_ID |
+---------+---------+--------+----------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+--------------------+----------------+-------+----------+
| my_kb   | mindsdb | [NULL] | my_kb_chromadb.default_collection      | {"embedding_model": {"provider": "openai", "model_name": "text-embedding-ada-002", "api_key": "sk-xxx"}, "reranking_model": {"provider": "openai", "model_name": "gpt-4o", "api_key": "sk-xxx"}, "default_vector_storage": "my_kb_chromadb"} | [NULL]            | [NULL]             | [NULL]         | [NULL]| [NULL]   |
+---------+---------+--------+----------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+--------------------+----------------+-------+----------+

INSERT INTO Syntax

Here is the syntax for inserting data into a knowledge base:

INSERT INTO my_kb
SELECT order_id, product, notes
FROM sample_data.orders;

Upon execution, it inserts data into a knowledge base, using the embedding model to embed it into vectors before inserting into an underlying vector database.

The status of the INSERT INTO is logged in the information_schema.queries table with the timestamp when it was ran, and can be queried as follows:

SELECT *
FROM information_schema.queries;

To speed up the initial loading of a knowledge base with data, run the INSERT INTO command with the kb_no_upsert flag. It is recommended to use this flag only when the knowledge base is empty, that is, only for the initial data insertion.

INSERT INTO my_kb
SELECT *
FROM table_name
USING kb_no_upsert = true;

Handling duplicate data while inserting into the knowledge base

Knowledge bases uniquely identify data rows using an ID column, which prevents from inserting duplicate data, as follows.

  • Case 1: Inserting data into the knowledge base without the id_column defined.

    When users do not define the id_column during the creation of a knowledge base, MindsDB generates the ID for each row using a hash of the content columns, as explained here.

    Example:

    If two rows have exactly the same content in the content columns, their hash (and thus their generated ID) will be the same.

    Note that duplicate rows are skipped and not inserted.

    Since both rows in the below table have the same content, only one row will be inserted.

    nameage
    Alice25
    Alice25
  • Case 2: Inserting data into the knowledge base with the id_column defined.

    When users define the id_column during the creation of a knowledge base, then the knowledge base uses that column’s values as the row ID.

    Example:

    If the id_column has duplicate values, the knowledge base skips the duplicate row(s) during the insert.

    The second row in the below table has the same id as the first row, so only one of these rows is inserted.

    idnameage
    1Alice25
    1Bob30

Best practice

Ensure the id_column uniquely identifies each row to avoid unintentional data loss due to duplicate ID skipping.

Update Existing Data

In order to update existing data in the knowledge base, insert data with the column ID that you want to update and the updated content.

Here is an example of usage. A knowledge base stores the following data.

+----------+-------------------+------------------------+
| order_id | product           | notes                  |
+----------+-------------------+------------------------+
| A1B      | Wireless Mouse    | Request color: black   |
| 3XZ      | Bluetooth Speaker | Gift wrap requested    |
| Q7P      | Laptop Stand      | Prefer aluminum finish |
+----------+-------------------+------------------------+

A user updated Laptop Stand to Aluminum Laptop Stand.

+----------+-----------------------+------------------------+
| order_id | product               | notes                  |
+----------+-----------------------+------------------------+
| A1B      | Wireless Mouse        | Request color: black   |
| 3XZ      | Bluetooth Speaker     | Gift wrap requested    |
| Q7P      | Aluminum Laptop Stand | Prefer aluminum finish |
+----------+-----------------------+------------------------+

Go to the Complete Example section below to find out how to access this sample data.

Here is how to propagate this change into the knowledge base.

INSERT INTO my_kb
SELECT order_id, product, notes
FROM sample_data.orders
WHERE order_id = 'Q7P';

The knowledge base matches the ID value to the existing one and updates the data if required.

Insert Data using Partitions

In order to optimize the performance of data insertion into the knowledge base, users can set up partitions and threads to insert batches of data in parallel. This also enables tracking the progress of data insertion process including cancelling and resuming it if required.

Here is an example.

INSERT INTO my_kb
SELECT order_id, product, notes
FROM sample_data.orders
USING
    batch_size = 200,
    track_column = order_id,
    threads = 10,
    error = 'skip';

The parameters include the following:

  • batch_size defines the number of rows fetched per iteration to optimize data extraction from the source. It defaults to 1000.

  • threads defines threads for running partitions. Note that if the ML task queue is enabled, threads are used automatically. The available values for include a number of threads to be used, for example, threads = 10, or a boolean value that defines whether to enable threads, setting threads = true, or disable threads, setting threads = false.

  • track_column defines the column used for sorting data before partitioning.

  • error defines the error processing options. The available values include raise, used to raise errors as they come, or skip, used to subside errors. It defaults to raise if not provided.

After executing the INSERT INTO statement with the above parameters, users can view the data insertion progress by querying the information_schema.queries table.

SELECT * FROM information_schema.queries;

Users can cancel the data insertion process using the process ID from the information_schema.queries table.

SELECT query_cancel(1);

Note that canceling the query will not remove the already inserted data.

Users can resume the data insertion process using the process ID from the information_schema.queries table.

SELECT query_resume(1);

Chunking Data

Upon inserting data into the knowledge base, the data chunking is performed in order to optimize the storage and search of data.

Each chunk is identified by its chunk ID of the following format: <id>:<chunk_number>of<total_chunks>:<start_char_number>to<end_char_number>.

Text

Users can opt for defining the chunking parameters when creating a knowledge base.

CREATE KNOWLEDGE_BASE my_kb
USING
    ...
    preprocessing = {
        "text_chunking_config" : {
            "chunk_size": 2000,
            "chunk_overlap": 200
        }
    },
    ...;

The chunk_size parameter defines the size of the chunk as the number of characters. And the chunk_overlap parameter defines the number of characters that should overlap between subsequent chunks.

JSON

Users can opt for defining the chunking parameters specifically for JSON data.

CREATE KNOWLEDGE_BASE my_kb
USING
    ...
    preprocessing = {
        "type": "json_chunking",
        "json_chunking_config" : {
            ...
        }
    },
    ...;

When the type of chunking is set to json_chunking, users can configure it by setting the following parameter values in the json_chunking_config parameter:

  • flatten_nested
    It is of the bool data type with the default value of True.
    It defines whether to flatten nested JSON structures.

  • include_metadata
    It is of the bool data type with the default value of True.
    It defines whether to include original metadata in chunks.

  • chunk_by_object
    It is of the bool data type with the default value of True.
    It defines whether to chunk by top-level objects (True) or create a single document (False).

  • exclude_fields
    It is of the List[str] data type with the default value of an empty list.
    It defines the list of fields to exclude from chunking.

  • include_fields
    It is of the List[str] data type with the default value of an empty list.
    It defines the list of fields to include in chunking (if empty, all fields except excluded ones are included).

  • metadata_fields
    It is of the List[str] data type with the default value of an empty list.
    It defines the list of fields to extract into metadata for filtering (can include nested fields using dot notation). If empty, all primitive fields will be extracted (top-level fields if available, otherwise all primitive fields in the flattened structure).

  • extract_all_primitives
    It is of the bool data type with the default value of False.
    It defines whether to extract all primitive values (strings, numbers, booleans) into metadata.

  • nested_delimiter
    It is of the str data type with the default value of ".".
    It defines the delimiter for flattened nested field names.

  • content_column
    It is of the str data type with the default value of "content".
    It defines the name of the content column for chunk ID generation.

Underlying Vector Store

Each knowledge base has its underlying vector store that stores data inserted into the knowledge base in the form of embeddings.

Users can query the underlying vector store as follows.

  • KB with the default ChromaDB vector store:
SELECT id, content, metadata, embeddings
FROM <kb_name>_chromadb.storage_table;
SELECT id, content, metadata, embeddings
FROM <vector_store_connection_name>.<table_name>;

Example

Here a sample knowledge base created in the previous Example section is inserted into.

INSERT INTO my_kb
SELECT order_id, product, notes
FROM sample_data.orders;

When inserting into a knowledge base where the content_columns parameter was not specified, the column storing content must be aliased AS content as below.

CREATE KNOWLEDGE_BASE my_kb
USING
    ...
    id_column = 'order_id',
    ...
INSERT INTO my_kb
SELECT order_id, notes AS content
FROM sample_data.orders;

SELECT FROM KB Syntax

Knowledge bases provide an abstraction that enables users to see the stored data.

Note that here a sample knowledge base created and inserted into in the previous Example sections is searched.

SELECT *
FROM my_kb;

Here is the sample output:

+-----+----------------------+-------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------------------+
| id  | chunk_id             | chunk_content           | metadata                                                                                                                                                                                            | distance           | relevance          |
+-----+----------------------+-------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------------------+
| A1B | A1B_notes:1of1:0to20 | Request color: black    | {"chunk_index":0,"content_column":"notes","end_char":20,"original_doc_id":"A1B_notes","original_row_id":"A1B","product":"Wireless Mouse","source":"TextChunkingPreprocessor","start_char":0}        | 0.5743341242061104 | 0.5093188026135379 |
| Q7P | Q7P_notes:1of1:0to22 | Prefer aluminum finish  | {"chunk_index":0,"content_column":"notes","end_char":22,"original_doc_id":"Q7P_notes","original_row_id":"Q7P","product":"Aluminum Laptop Stand","source":"TextChunkingPreprocessor","start_char":0} | 0.7744703514692067 | 0.2502580835880018 |
| 3XZ | 3XZ_notes:1of1:0to19 | Gift wrap requested     | {"chunk_index":0,"content_column":"notes","end_char":19,"original_doc_id":"3XZ_notes","original_row_id":"3XZ","product":"Bluetooth Speaker","source":"TextChunkingPreprocessor","start_char":0}     | 0.8010851611432231 | 0.2500003885558766 |
+-----+----------------------+-------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------------------+

Data Stored in Knowledge Base

The following columns are stored in the knowledge base.

id It stores values from the column defined in the id_column parameter when creating the knowledge base. These are the source data IDs.

chunk_id Knowledge bases chunk the inserted data in order to fit the defined chunk size. If the chunking is performed, the following chunk ID format is used: <id>:<chunk_number>of<total_chunks>:<start_char_number>to<end_char_number>.

chunk_content It stores values from the column(s) defined in the content_columns parameter when creating the knowledge base.

metadata It stores the general metadata and the metadata defined in the metadata_columns parameter when creating the knowledge base.

distance It stores the calculated distance between the chunk’s content and the search phrase.

relevance It stores the calculated relevance of the chunk as compared to the search phrase.

Note that the calculation method of relevance_threshold differs as follows:

  • When the ranking model is provided, the default relevance_threshold is 0, unless defined otherwise in the WHERE clause.
  • When the reranking model is not provided and the relevance_threshold is not defined in the query, then no relevance filtering is applied and the output includes all rows matched based on the similarity and metadata search.
  • When the reranking model is not provided but the relevance_threshold is defined in the query, then the relevance is calculated based on the distance column (1/(1+ distance)) and the relevance_threshold value is compared with this relevance value to filter the output.

Users can query a knowledge base using semantic search by providing the search phrase (called content) to be searched for.

SELECT *
FROM my_kb
WHERE content = 'color'

Here is the output:

+-----+----------------------+-------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------------------+
| id  | chunk_id             | chunk_content           | metadata                                                                                                                                                                                            | distance           | relevance          |
+-----+----------------------+-------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------------------+
| A1B | A1B_notes:1of1:0to20 | Request color: black    | {"chunk_index":0,"content_column":"notes","end_char":20,"original_doc_id":"A1B_notes","original_row_id":"A1B","product":"Wireless Mouse","source":"TextChunkingPreprocessor","start_char":0}        | 0.5743341242061104 | 0.5093188026135379 |
| Q7P | Q7P_notes:1of1:0to22 | Prefer aluminum finish  | {"chunk_index":0,"content_column":"notes","end_char":22,"original_doc_id":"Q7P_notes","original_row_id":"Q7P","product":"Aluminum Laptop Stand","source":"TextChunkingPreprocessor","start_char":0} | 0.7744703514692067 | 0.2502580835880018 |
| 3XZ | 3XZ_notes:1of1:0to19 | Gift wrap requested     | {"chunk_index":0,"content_column":"notes","end_char":19,"original_doc_id":"3XZ_notes","original_row_id":"3XZ","product":"Bluetooth Speaker","source":"TextChunkingPreprocessor","start_char":0}     | 0.8010851611432231 | 0.2500003885558766 |
+-----+----------------------+-------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------------------+

When querying a knowledge base, the default values include the following:

  • relevance_threshold
    If not provided, its default value is 0, ensuring there is no filtering of rows based on their relevance.

  • LIMIT
    If not provided, its default value is 10, returning a maximum of 10 rows.

Note that when specifying both relevance_threshold and LIMIT as follows:

SELECT *
FROM my_kb
WHERE content = 'color'
AND relevance_threshold = 0.5
LIMIT 20;

The query extracts 20 rows (as defined in the LIMIT clause) that match the defined content. Next, these set of rows is filtered out to match the defined relevance_threshold.

Users can limit the relevance_threshold in order to get only the most relevant results.

SELECT *
FROM my_kb
WHERE content = 'color'
AND relevance_threshold = 0.5;

Here is the output:

+-----+----------------------+------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------------------+
| id  | chunk_id             | chunk_content          | metadata                                                                                                                                                                                     | distance           | relevance          |
+-----+----------------------+------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------------------+
| A1B | A1B_notes:1of1:0to20 | Request color: black   | {"chunk_index":0,"content_column":"notes","end_char":20,"original_doc_id":"A1B_notes","original_row_id":"A1B","product":"Wireless Mouse","source":"TextChunkingPreprocessor","start_char":0} | 0.5743341242061104 | 0.5103766499957533 |
+-----+----------------------+------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------------------+

By providing the relevance_threshold filter, the output is limited to only data with relevance score of the provided value or higher. The available value of relevance_threshold are between 0 and 1, and its default value is 0.

Users can limit the number of rows returned.

SELECT *
FROM my_kb
WHERE content = 'color'
LIMIT 2;

Here is the output:

+-----+----------------------+-------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------------------+
| id  | chunk_id             | chunk_content           | metadata                                                                                                                                                                                            | distance           | relevance          |
+-----+----------------------+-------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------------------+
| A1B | A1B_notes:1of1:0to20 | Request color: black    | {"chunk_index":0,"content_column":"notes","end_char":20,"original_doc_id":"A1B_notes","original_row_id":"A1B","product":"Wireless Mouse","source":"TextChunkingPreprocessor","start_char":0}        | 0.5743341242061104 | 0.5093188026135379 |
| Q7P | Q7P_notes:1of1:0to22 | Prefer aluminum finish  | {"chunk_index":0,"content_column":"notes","end_char":22,"original_doc_id":"Q7P_notes","original_row_id":"Q7P","product":"Aluminum Laptop Stand","source":"TextChunkingPreprocessor","start_char":0} | 0.7744703514692067 | 0.2502580835880018 |
+-----+----------------------+-------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------------------+

Metadata Filtering

Besides semantic search features, knowledge bases enable users to filter the result set by the defined metadata.

SELECT *
FROM my_kb
WHERE product = 'Wireless Mouse';

Here is the output:

+-----+----------------------+------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+----------+
| id  | chunk_id             | chunk_content          | metadata                                                                                                                                                                                     | relevance | distance |
+-----+----------------------+------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+----------+
| A1B | A1B_notes:1of1:0to20 | Request color: black   | {"chunk_index":0,"content_column":"notes","end_char":20,"original_doc_id":"A1B_notes","original_row_id":"A1B","product":"Wireless Mouse","source":"TextChunkingPreprocessor","start_char":0} | [NULL]    | [NULL]   |
+-----+----------------------+------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+----------+

Note that when searching by metadata alone, the relevance column values are not calculated.

Users can do both, filter by metadata and search by content.

SELECT *
FROM my_kb
WHERE product = 'Wireless Mouse'
AND content = 'color'
AND relevance_threshold = 0.5;

Here is the output:

+-----+----------------------+------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+-------------------+
| id  | chunk_id             | chunk_content          | metadata                                                                                                                                                                                     | distance           | relevance         |
+-----+----------------------+------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+-------------------+
| A1B | A1B_notes:1of1:0to20 | Request color: black   | {"chunk_index":0,"content_column":"notes","end_char":20,"original_doc_id":"A1B_notes","original_row_id":"A1B","product":"Wireless Mouse","source":"TextChunkingPreprocessor","start_char":0} | 0.5743341242061104 | 0.504396172197583 |
+-----+----------------------+------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+-------------------+

EVALUATE KNOWLEDGE_BASE Syntax

With the EVALUATE KNOWLEDGE_BASE command, users can evaluate the relevancy and accuracy of the documents and data returned by the knowledge base.

Below is the complete syntax that includes both required and optional parameters.

EVALUATE KNOWLEDGE_BASE my_kb
USING
    test_table = my_datasource.my_test_table,
    version = 'doc_id',
    generate_data = {
        'from_sql': 'SELECT content FROM my_datasource.my_table',
        'count': 100
    }, 
    evaluate = false,
    llm = {
        'provider': 'openai',
        'api_key':'sk-xxx',
        'model':'gpt-4'
    },
    save_to = my_datasource.my_result_table; 

test_table

This is a required parameter that stores the name of the table from one of the data sources connected to MindsDB. For example, test_table = my_datasource.my_test_table defines a table named my_test_table from a data source named my_datasource.

This test table stores test data commonly in form of questions and answers. Its content depends on the version parameter defined below.

Users can provide their own test data or have the test data generated by the EVALUATE KNOWLEDGE_BASE command, which is performed when setting the generate_data parameter defined below.

version

This is an optional parameter that defines the version of the evaluator. If not defined, its default value is doc_id.

  • version = 'doc_id' The evaluator checks whether the document ID returned by the knowledge base matched the expected document ID as defined in the test table.

  • version = 'llm_relevancy' The evaluator uses a language model to rank and evaluate responses from the knowledge base.

generate_data

This is an optional parameter used to configure the test data generation, which is saved into the table defined in the test_table parameter. If not defined, its default value is false, meaning that no test data is generated.

Available values are as follows:

  • A dictionary containing the following values:

    • from_sql defines the SQL query that fetches the test data. For example, 'from_sql': 'SELECT content FROM my_datasource.my_table'. If not defined, it fetches test data from the knowledge base on which the EVALUATE command is executed: SELECT chunk_content, id FROM my_kb.
    • count defines the size of the test dataset. For example, 'count': 100. Its default value is 20.
  • A value of true, such as generate_data = true, which implies that default values for from_sql and count will be used.

evaluate

This is an optional parameter that defines whether to evaluate the knowledge base. If not defined, its default value is true.

Users can opt for setting it to false, evaluate = false, in order to generate test data into the test table without running the evaluator.

llm

This is an optional parameter that defines a language model to be used for evaluations, if version is set to llm_relevancy.

If not defined, its default value is the reranking_model defined with the knowledge base.

Users can define it with the EVALUATE KNOWLEDGE_BASE command in the same manner.

EVALUATE KNOWLEDGE_BASE my_kb
USING
    ...
    llm = {
        "provider": "azure_openai",
        "model_name" : "gpt-4o",
        "api_key": "sk-abc123",
        "base_url": "https://ai-6689.openai.azure.com/",
        "api_version": "2024-02-01",
        "method": "multi-class"
    },
    ...

save_to

This is an optional parameter that stores the name of the table from one of the data sources connected to MindsDB. For example, save_to = my_datasource.my_result_table defines a table named my_result_table from the data source named my_datasource. If not defined, the results are not saved into a table.

This table is used to save the evaluation results.

By default, evaluation results are returned after executing the EVALUATE KNOWLEDGE_BASE statement.

Evaluation Results

When using version = 'doc_id', the following columns are included in the evaluation results:

  • total stores the total number of questions.
  • total_found stores the number of questions to which the knowledge bases provided correct answers.
  • retrieved_in_top_10 stores the number of top 10 questions to which the knowledge bases provided correct answers.
  • cumulative_recall stores data that can be used to create a chart.
  • avg_query_time stores the execution time of a search query of the knowledge base.
  • name stores the knowledge base name.
  • created_at stores the timestamp when the evaluation was created.

When using version = 'llm_relevancy', the following columns are included in the evaluation results:

  • avg_relevancy stores the average relevancy.
  • avg_relevance_score_by_k stores the average relevancy at k.
  • avg_first_relevant_position stores the average first relevant position.
  • mean_mrr stores the Mean Reciprocal Rank (MRR).
  • hit_at_k stores the Hit@k value.
  • bin_precision_at_k stores the Binary Precision@k.
  • avg_entropy stores the average relevance score entropy.
  • avg_ndcg stores the average nDCG.
  • avg_query_time stores the execution time of a search query of the knowledge base.
  • name stores the knowledge base name.
  • created_at stores the timestamp when the evaluation was created.

CREATE INDEX ON KNOWLEDGE_BASE Syntax

Users can create index on the knowledge base to speed up the search operations.

CREATE INDEX ON KNOWLEDGE_BASE my_kb;

Note that this feature works only when PGVector is used as the storage of the knowledge base, as ChromaDB provides the index features by default.

Upon executing this statement, an index is created on the knowledge base’s underlying vector store. This is essentially a database index created on the vector database.

Note that having an index on the knowledge base may reduce the speed of the insert operations. Therefore, it is recommended to insert bulk data into the knowledge base before creating an index. The index improves performance of querying the knowledge base, while it may slow down subsequent data inserts.

JOIN Syntax

Knowledge bases can be used in the standard SQL JOIN statements.

SELECT t.order_id, t.product, t.notes, kb.chunk_content, kb.relevance
FROM local_postgres.orders AS t
JOIN my_kb AS kb
ON t.order_id = kb.id
WHERE t.order_id = 'A1B'
AND kb.content = 'color'
AND kb.product = 'Wireless Mouse';

Here is the output:

+----------+------------------+------------------------+------------------------+--------------------+
| order_id | product          | notes                  | chunk_content          | relevance          |
+----------+------------------+------------------------+------------------------+--------------------+
| A1B      | Wireless Mouse   | Request color: black   | Request color: black   | 0.5106591666649376 |
+----------+------------------+------------------------+------------------------+--------------------+

DELETE FROM Syntax

Here is the syntax for deleting from a knowledge base:

DELETE FROM my_kb
WHERE id = 'A1B';

Upon execution, it identifies matching records based on the user-defined condition and removes all associated data (metadata, content, chunks, embeddings) for matching records from the KB’s storage.

DROP KNOWLEDGE_BASE Syntax

Here is the syntax for deleting a knowledge base:

DROP KNOWLEDGE_BASE my_kb;

Upon execution, it removes the knowledge base with its content.

Examples

Sales Data

Here is the data that will be inserted into the knowledge base.

+----------+-------------------+------------------------+
| order_id | product           | notes                  |
+----------+-------------------+------------------------+
| A1B      | Wireless Mouse    | Request color: black   |
| 3XZ      | Bluetooth Speaker | Gift wrap requested    |
| Q7P      | Laptop Stand      | Prefer aluminum finish |
+----------+-------------------+------------------------+

You can access this sample data as below:

CREATE DATABASE sample_data
WITH ENGINE = 'postgres',
PARAMETERS = {
    "user": "demo_user",
    "password": "demo_password",
    "host": "samples.mindsdb.com",
    "port": "5432",
    "database": "demo",
    "schema": "demo_data"
};

SELECT * FROM sample_data.orders;

Here is how to create a knowledge base specifically for the data.

CREATE KNOWLEDGE_BASE my_kb
USING
    embedding_model = {
        "provider": "openai",
        "model_name" : "text-embedding-3-large",
        "api_key": "sk-abc123"
    },
    reranking_model = {
        "provider": "openai",
        "model_name": "gpt-4o",
        "api_key": "sk-abc123"
    },
    metadata_columns = ['product'],
    content_columns = ['notes'],
    id_column = 'order_id';

Here is how to insert the data.

INSERT INTO my_kb
SELECT order_id, product, notes
FROM sample_data.orders;

Here is how to query the knowledge base.

SELECT *
FROM my_kb
WHERE product = 'Wireless Mouse'
AND content = 'color'
AND relevance_threshold = 0.5;

Financial Data

You can access the sample data as below:

CREATE DATABASE sample_data
WITH ENGINE = 'postgres',
PARAMETERS = {
    "user": "demo_user",
    "password": "demo_password",
    "host": "samples.mindsdb.com",
    "port": "5432",
    "database": "demo",
    "schema": "demo_data"
};

SELECT * FROM sample_data.financial_headlines;

Here is how to create a knowledge base specifically for the data.

CREATE KNOWLEDGE_BASE my_kb
USING
    embedding_model = {
        "provider": "openai",
        "model_name" : "text-embedding-3-large",
        "api_key": "sk-xxx"
    },
    reranking_model = {
        "provider": "openai",
        "model_name": "gpt-4o",
        "api_key": "sk-xxx"
    },
    metadata_columns = ['sentiment_labelled'],
    content_columns = ['headline'];

Here is how to insert the data.

INSERT INTO my_kb
SELECT *
FROM sample_data.financial_headlines
USING
    batch_size = 500,
    threads = 10;

Here is how to query the knowledge base.

  • Query without defined LIMIT
SELECT *
FROM my_kb
WHERE content = 'investors';

This query returns 10 rows, as the default LIMIT is set to 10.

  • Query with defined LIMIT
SELECT *
FROM my_kb
WHERE content = 'investors'
LIMIT 20;

This query returns 20 rows, as the user-defined LIMIT is set to 20.

  • Query with defined LIMIT and relevance_threshold
SELECT *
FROM my_kb
WHERE content = 'investors'
AND relevance_threshold = 0.8
LIMIT 20;

This query may return 20 or less rows, depending on whether the relevance scores of the rows match the user-defined condition.