Knowledge Base
A knowledge base is an advanced system that organizes information based on semantic meaning rather than simple keyword matching. It integrates embedding models, reranking models, and vector stores to enable context-aware data retrieval.
By performing semantic reasoning across multiple data points, a knowledge base delivers deeper insights and more accurate responses, making it a powerful tool for intelligent data access.
How Knowledge Bases Work
Before diving into the syntax, here is a quick walkthrough showing how knowledge bases work in MindsDB.
We start by creating a knowledge base and inserting data. Next we can run semantic search queries with metadata filtering.
Create a knowledge base
Use the create()
function to create a knowledge base, specifying all its components.
Insert data into the knowledge base
In this example, we use a simple dataset containing customer notes for product orders which will be inserted into the knowledge base.
Use the insert_query()
function to ingest data into the knowledge base from a query.
Run semantic search on the knowledge base
Query the knowledge base using semantic search.
This query returns:
Get the most relevant search results
Query the knowledge base using semantic search and define the relevance
parameter to receive only the best matching data for your use case.
This query returns:
Filter results by metadata
Add metadata filtering to focus your search.
This query returns:
The following sections explain the syntax and other features of knowledge bases.
create()
Function
Here is the syntax for creating a knowledge base:
Upon execution, it registers my_kb
and associates the specified models and storage. my_kb
is a unique identifier of the knowledge base within MindsDB.
As MindsDB stores objects, such as models or knowledge bases, inside projects, you can create a knowledge base inside a custom project.
Supported LLMs
Below is the list of all language models supported for the embedding_model
and reranking_model
parameters.
provider = 'openai'
When choosing openai
as the model provider, users should define the following model parameters.
model_name
stores the name of the OpenAI model to be used.api_key
stores the OpenAI API key.
Learn more about the OpenAI integration with MindsDB here.
provider = 'openai_azure'
When choosing openai_azure
as the model provider, users should define the following model parameters.
model_name
stores the name of the OpenAI model to be used.api_key
stores the OpenAI API key.base_url
stores the base URL of the Azure instance.api_version
stores the version of the Azure instance.
provider = 'bedrock'
When choosing bedrock
as the model provider, users should define the following model parameters.
model_name
stores the name of the model available via Amazon Bedrock.aws_access_key_id
stores a unique identifier associated with your AWS account, used to identify the user or application making requests to AWS.aws_region_name
stores the name of the AWS region you want to send your requests to (e.g.,"us-west-2"
).aws_secret_access_key
stores the secret key associated with your AWS access key ID. It is used to sign your requests securely.aws_session_token
stores a temporary token used for short-term security credentials when using AWS Identity and Access Management (IAM) roles or temporary credentials.
provider = 'snowflake'
When choosing snowflake
as the model provider, users should choose one of the available models from Snowflake Cortex AI and define the following model parameters.
model_name
stores the name of the model available via Snowflake Cortex AI.api_key
stores the Snowflake Cortex AI API key.snowflake_account_id
stores the Snowflake account ID.
embedding_model
The embedding model is a required component of the knowledge base. It stores specifications of the embedding model to be used.
Users can define the embedding model choosing one of the following options.
Option 1. Use the embedding_model
parameter to define the specification.
Option 2. Define the default embedding model in the MindsDB configuration file.
You can define the default models in the Settings of the MindsDB Editor GUI.
Note that if you define default_embedding_model
in the configuration file, you do not need to provide the embedding_model
parameter when creating a knowledge base. If provide both, then the values from the embedding_model
parameter are used.
The embedding model specification includes:
-
provider
It is a required parameter. It defines the model provider as listed in supported LLMs. -
model_name
It is a required parameter. It defines the embedding model name as specified by the provider. -
api_key
The API key is required to access the embedding model assigned to a knowledge base. Users can provide it either in thisapi_key
parameter, or in theOPENAI_API_KEY
environment variable for"provider": "openai"
andAZURE_OPENAI_API_KEY
environment variable for"provider": "azure_openai"
. -
base_url
It is an optional parameter, which defaults tohttps://api.openai.com/v1/
. It is a required parameter when using theazure_openai
provider. It is the root URL used to send API requests. -
api_version
It is an optional parameter. It is a required parameter when using theazure_openai
provider. It defines the API version.
reranking_model
The reranking model is an optional component of the knowledge base. It stores specifications of the reranking model to be used.
Users can disable reranking features of knowledge bases by setting this parameter to false
.
Users can enable reranking features of knowledge bases by defining the reranking model choosing one of the following options.
Option 1. Use the reranking_model
parameter to define the specification.
Option 2. Define the default reranking model in the MindsDB configuration file.
You can define the default models in the Settings of the MindsDB Editor GUI.
Note that if you define default_reranking_model
in the configuration file, you do not need to provide the reranking_model
parameter when creating a knowledge base. If provide both, then the values from the reranking_model
parameter are used.
The reranking model specification includes:
-
provider
It is a required parameter. It defines the model provider. Currently, the supported providers include OpenAI (openai
) and OpenAI via Azure (azure_openai
). -
model_name
It is a required parameter. It defines the embedding model name as specified by the provider. Users can choose one of the OpenAI chat models. -
api_key
The API key is required to access the embedding model assigned to a knowledge base. Users can provide it either in thisapi_key
parameter, or in theOPENAI_API_KEY
environment variable for"provider": "openai"
andAZURE_OPENAI_API_KEY
environment variable for"provider": "azure_openai"
. -
base_url
It is an optional parameter, which defaults tohttps://api.openai.com/v1/
. It is a required parameter when using theazure_openai
provider. It is the root URL used to send API requests. -
api_version
It is an optional parameter. It is a required parameter when using theazure_openai
provider. It defines the API version. -
method
It is an optional parameter. It defines the method used to calculate the relevance of the output rows. The available options includemulti-class
andbinary
. It defaults tomulti-class
.
Reranking Method
The multi-class
reranking method classifies each document chunk (that meets any specified metadata filtering conditions) into one of four relevance classes:
- Not relevant with class weight of 0.25.
- Slightly relevant with class weight of 0.5.
- Moderately relevant with class weight of 0.75.
- Highly relevant with class weight of 1.
The overall relevance_score
of a document is calculated as the sum of each chunk’s class weight multiplied by its class probability (from model logprob output).
The binary
reranking method simplifies classification by determining whether a document is relevant or not, without intermediate relevance levels. With this method, the overall relevance_score
of a document is calculated based on the model log probability.
storage
The vector store is a required component of the knowledge base. It stores data in the form of embeddings.
It is optional for users to provide the storage
parameter. If not provided, the default ChromaDB is created when creating a knowledge base.
The available options include either PGVector or ChromaDB.
It is recommended to use PGVector version 0.8.0 or higher for a better performance.
If the storage
parameter is not provided, the system creates the default ChromaDB vector database called <kb_name>_chromadb
with the default table called default_collection
that stores the embedded data. This default ChromaDB vector database is stored in MindsDB’s storage.
In order to provide the storage vector database, it is required to connect it to MindsDB beforehand.
Here is an example for PGVector.
Note that you do not need to have the storage_table
created as it is created when creating a knowledge base.
metadata_columns
The data inserted into the knowledge base can be classified as metadata, which enables users to filter the search results using defined data fields.
Note that source data column(s) included in metadata_columns
cannot be used in content_columns
, and vice versa.
This parameter is an array of strings that lists column names from the source data to be used as metadata. If not provided, then all inserted columns (except for columns defined as id_column
and content_columns
) are considered metadata columns.
Here is an example of usage. A user wants to store the following data in a knowledge base.
Go to the Complete Example section below to find out how to access this sample data.
The product
column can be used as metadata to enable metadata filtering.
content_columns
The data inserted into the knowledge base can be classified as content, which is embedded by the embedding model and stored in the underlying vector store.
Note that source data column(s) included in content_columns
cannot be used in metadata_columns
, and vice versa.
This parameter is an array of strings that lists column names from the source data to be used as content and processed into embeddings. If not provided, the content
column is expected by default when inserting data into the knowledge base.
Here is an example of usage. A user wants to store the following data in a knowledge base.
Go to the Complete Example section below to find out how to access this sample data.
The notes
column can be used as content.
id_column
The ID column uniquely identifies each source data row in the knowledge base.
It is an optional parameter. If provided, this parameter is a string that contains the source data ID column name. If not provided, it is generated from the hash of the content columns.
Here is an example of usage. A user wants to store the following data in a knowledge base.
Go to the Complete Example section below to find out how to access this sample data.
The order_id
column can be used as ID.
Note that if the source data row is chunked into multiple chunks by the knowledge base (that is, to optimize the storage), then these rows in the knowledge base have the same ID value that identifies chunks from one source data row.
Available options for the ID column values
-
User-Defined ID Column:
When users defined theid_column
parameter, the values from the provided source data column are used to identify source data rows within the knowledge base. -
User-Generated ID Column:
When users do not have a column that uniquely identifies each row in their source data, they can generate the ID column values when inserting data into the knowledge base using functions likeHASH()
orROW_NUMBER()
. -
Default ID Column:
If theid_column
parameter is not defined, its default values are build from the hash of the content columns and follow the format:<first 16 char of md5 hash of row content>
.
list()
and get()
Functions
Users can get details about the knowledge base using the get()
function.
And list all available knowledge bases using the list()
function.
insert()
Function
Here is the syntax for inserting data into a knowledge base:
-
Inserting raw data:
-
Inserting data from data sources connected to MindsDB:
-
Inserting data from files uploaded to MindsDB:
-
Inserting data from webpages:
Where:
urls
: Base URLs to crawl.crawl_depth
: Depth for recursive crawling. Default is 1.filters
: Regex patterns to include.limit
: Max number of pages.
Upon execution, it inserts data into a knowledge base, using the embedding model to embed it into vectors before inserting into an underlying vector database.
The status of the insert operations is logged in the information_schema.queries
table with the timestamp when it was ran.
Handling duplicate data while inserting into the knowledge base
Knowledge bases uniquely identify data rows using an ID column, which prevents from inserting duplicate data, as follows.
-
Case 1: Inserting data into the knowledge base without the
id_column
defined.When users do not define the
id_column
during the creation of a knowledge base, MindsDB generates the ID for each row using a hash of the content columns, as explained here.Example:
If two rows have exactly the same content in the content columns, their hash (and thus their generated ID) will be the same.
Note that duplicate rows are skipped and not inserted.
Since both rows in the below table have the same content, only one row will be inserted.
name age Alice 25 Alice 25 -
Case 2: Inserting data into the knowledge base with the
id_column
defined.When users define the
id_column
during the creation of a knowledge base, then the knowledge base uses that column’s values as the row ID.Example:
If the
id_column
has duplicate values, the knowledge base skips the duplicate row(s) during the insert.The second row in the below table has the same
id
as the first row, so only one of these rows is inserted.id name age 1 Alice 25 1 Bob 30
Best practice
Ensure the id_column
uniquely identifies each row to avoid unintentional data loss due to duplicate ID skipping.
Update Existing Data
In order to update existing data in the knowledge base, insert data with the column ID that you want to update and the updated content.
Here is an example of usage. A knowledge base stores the following data.
A user updated Laptop Stand
to Aluminum Laptop Stand
.
Go to the Complete Example section below to find out how to access this sample data.
Here is how to propagate this change into the knowledge base.
The knowledge base matches the ID value to the existing one and updates the data if required.
Insert Data using Partitions
In order to optimize the performance of data insertion into the knowledge base, users can set up partitions and threads to insert batches of data in parallel. This also enables tracking the progress of data insertion process including cancelling and resuming it if required.
Here is an example.
The parameters include the following:
-
batch_size
defines the number of rows fetched per iteration to optimize data extraction from the source. It defaults to 1000. -
threads
defines threads for running partitions. Note that if the ML task queue is enabled, threads are used automatically. The available values forthreads
are:- a number of threads to be used, for example,
threads = 10
, - a boolean value that defines whether to enable threads, setting
threads = true
, or disable threads, settingthreads = false
.
- a number of threads to be used, for example,
-
track_column
defines the column used for sorting data before partitioning. -
error
defines the error processing options. The available values includeraise
, used to raise errors as they come, orskip
, used to subside errors. It defaults toraise
if not provided.
After executing the INSERT INTO
statement with the above parameters, users can view the data insertion progress by querying the information_schema.queries
table.
Users can cancel the data insertion process using the process ID from the information_schema.queries
table.
If you want to cancel the data insertion process, look up the process ID value from the information_schema.queries
table and pass it as an argument to the query_cancel()
function. Note that canceling the query will not remove the already inserted data.
Users can resume the data insertion process using the process ID from the information_schema.queries
table.
If you want to resume the data insertion process (which may have been interrupted by an error or cancelled by a user), look up the process ID value from the information_schema.queries
table and pass it as an argument to the query_resume()
function. Note that resuming the query will not remove the already inserted data and will start appending the remaining data.
Chunking Data
Upon inserting data into the knowledge base, the data chunking is performed in order to optimize the storage and search of data.
Each chunk is identified by its chunk ID of the following format: <id>:<chunk_number>of<total_chunks>:<start_char_number>to<end_char_number>
.
Text
Users can opt for defining the chunking parameters when creating a knowledge base.
The chunk_size
parameter defines the size of the chunk as the number of characters. And the chunk_overlap
parameter defines the number of characters that should overlap between subsequent chunks.
JSON
Users can opt for defining the chunking parameters specifically for JSON data.
When the type
of chunking is set to json_chunking
, users can configure it by setting the following parameter values in the json_chunking_config
parameter:
-
flatten_nested
It is of thebool
data type with the default value ofTrue
.
It defines whether to flatten nested JSON structures. -
include_metadata
It is of thebool
data type with the default value ofTrue
.
It defines whether to include original metadata in chunks. -
chunk_by_object
It is of thebool
data type with the default value ofTrue
.
It defines whether to chunk by top-level objects (True
) or create a single document (False
). -
exclude_fields
It is of theList[str]
data type with the default value of an empty list.
It defines the list of fields to exclude from chunking. -
include_fields
It is of theList[str]
data type with the default value of an empty list.
It defines the list of fields to include in chunking (if empty, all fields except excluded ones are included). -
metadata_fields
It is of theList[str]
data type with the default value of an empty list.
It defines the list of fields to extract into metadata for filtering (can include nested fields using dot notation). If empty, all primitive fields will be extracted (top-level fields if available, otherwise all primitive fields in the flattened structure). -
extract_all_primitives
It is of thebool
data type with the default value ofFalse
.
It defines whether to extract all primitive values (strings, numbers, booleans) into metadata. -
nested_delimiter
It is of thestr
data type with the default value of"."
.
It defines the delimiter for flattened nested field names. -
content_column
It is of thestr
data type with the default value of"content"
.
It defines the name of the content column for chunk ID generation.
Underlying Vector Store
Each knowledge base has its underlying vector store that stores data inserted into the knowledge base in the form of embeddings.
Users can query the underlying vector store as follows.
- KB with the default ChromaDB vector store:
find()
Function
Knowledge bases provide an abstraction that enables users to see the stored data.
Note that here a sample knowledge base created and inserted into in the previous Example sections is searched.
Here is the sample output:
Data Stored in Knowledge Base
The following columns are stored in the knowledge base.
id
It stores values from the column defined in the id_column
parameter when creating the knowledge base. These are the source data IDs.
chunk_id
Knowledge bases chunk the inserted data in order to fit the defined chunk size. If the chunking is performed, the following chunk ID format is used: <id>:<chunk_number>of<total_chunks>:<start_char_number>to<end_char_number>
.
chunk_content
It stores values from the column(s) defined in the content_columns
parameter when creating the knowledge base.
metadata
It stores the general metadata and the metadata defined in the metadata_columns
parameter when creating the knowledge base.
distance
It stores the calculated distance between the chunk’s content and the search phrase.
relevance
It stores the calculated relevance of the chunk as compared to the search phrase. Its values are between 0 and 1.
Note that the calculation method of relevance
differs as follows:
- When the ranking model is provided, the default
relevance
is equal or greater than 0, unless defined otherwise in theWHERE
clause. - When the reranking model is not provided and the
relevance
is not defined in the query, then no relevance filtering is applied and the output includes all rows matched based on the similarity and metadata search. - When the reranking model is not provided but the
relevance
is defined in the query, then the relevance is calculated based on thedistance
column (1/(1+ distance)
) and therelevance
value is compared with this relevance value to filter the output.
Semantic Search
Users can query a knowledge base using semantic search by providing the search phrase (called content
) to be searched for.
Here is the output:
When querying a knowledge base, the default values include the following:
-
relevance
If not provided, its default value is equal to or greater than 0, ensuring there is no filtering of rows based on their relevance. -
LIMIT
If not provided, its default value is 10, returning a maximum of 10 rows.
Note that when specifying both relevance
and LIMIT
as follows:
The query extracts 20 rows (as defined in the LIMIT
clause) that match the defined content
. Next, these set of rows is filtered out to match the defined relevance
.
Users can limit the relevance
in order to get only the most relevant results.
Here is the output:
By providing the relevance
filter, the output is limited to only data with relevance score of the provided value. The available values of relevance
are between 0 and 1, and its default value covers all available relevance values ensuring no filtering based on the relevance score.
Users can limit the number of rows returned.
Here is the output:
Metadata Filtering
Besides semantic search features, knowledge bases enable users to filter the result set by the defined metadata.
Here is the output:
Note that when searching by metadata alone, the relevance
column values are not calculated.
Users can do both, filter by metadata and search by content.
Here is the output:
drop()
Function
Here is the syntax for deleting a knowledge base:
Upon execution, it removes the knowledge base with its content.
See more examples of knowledge bases via SQL here.