Knowledge Base
A knowledge base is an advanced system designed to go beyond traditional data storage. Its components include embedding and reranking models and a vector store.
It intelligently organizes information in semantically meaningful ways, allowing for context-aware retrieval based on meaning rather than just keywords. With the ability to perform semantic reasoning across multiple pieces of information, it delivers deeper insights and more accurate responses.
Create a Knowledge Base
Here is the syntax for creating a knowledge base:
MindsDB stores objects, such as models or knowledge bases, inside projects. Learn more about projects here.
Upon execution, it registers my_kb
and associates the specified models and storage.
The following is an explanation of the syntax and parameters:
-
embedding_model
stores specifications of the embedding model to be used.provider
defines the model provider. Currently, the supported providers include OpenAI and OpenAI via Azure.model_name
defines the embedding model name as specified by the provider. Users can choose one of the OpenAI embedding models.base_url
is the root URL used to send API requests. It is required when usingazure_openai
as the provider. Its default value ishttps://api.openai.com/v1/
.api_key
stores your OpenAI API key. Alternatively, you can use theOPENAI_API_KEY
environment variable to store your OpenAI API key.api_version
is required when usingazure_openai
as the provider.
-
reranking_model
is an optional parameter that stores specification of the reranking model to be used.provider
defines the model provider. Currently, the supported providers include OpenAI and OpenAI via Azure.model_name
defines the reranking model name as specified by the provider. Users can choose one of the OpenAI chat models.base_url
is the root URL used to send API requests. It is required when usingazure_openai
as the provider. Its default value ishttps://api.openai.com/v1/
.api_key
stores your OpenAI API key. Alternatively, you can use theOPENAI_API_KEY
environment variable to store your OpenAI API key.api_version
is required when usingazure_openai
as the provider.method
defines the method used to calculate the relevance of the output rows. The available options includemulti-class
andbinary
. It defaults tomulti-class
.
Reranking Method
The multi-class
reranking method classifies each document chunk (that meets any specified metadata filtering conditions) into one of four relevance classes:
- Not relevant with class weight of 0.25.
- Slightly relevant with class weight of 0.5.
- Moderately relevant with class weight of 0.75.
- Highly relevant with class weight of 1.
The overall relevance_score
of a document is calculated as the sum of each chunk’s class weight multiplied by its class probability (from model logprob output).
The binary
reranking method simplifies classification by determining whether a document is relevant or not, without intermediate relevance levels. With this method, the overall relevance_score
of a document is calculated based on the model log probability.
-
storage
is an optional parameter. It assigns the name of the vector store created within MindsDB with theCREATE DATABASE
command, which is either PGVector or ChromaDB, followed by a table name where the knowledge base stores its content; note that this table is created within the vector store automatically. If not provided, the system creates the default ChromaDB vector database called<kb_name>_chromadb
with the default tables calleddefault_collection
that stores the embedded data. -
metadata_columns
provides the list of column names that will be used as the metadata of the knowledge base. Users can use these column names to insert them into the knowledge base and filter the search results by metadata. If not set, the metadata is not used. -
content_columns
provides the list of column names that will be embedded as the content of the knowledge base. Users can use these column names to insert them into the knowledge base and filter the search results by content. If not set, thecontent
column is used by default. -
id_column
is the column name that will be used as the unique identifier by the knowledge base. If not set, theid
column is used by default.
When the id_column
parameter is provided, the knowledge base uses this ID to identify rows from the source data.
When the id_column
is not provided, the knowledge base uses an auto-generated sequence of numbers to identify rows from the source data. Note that these auto-generated IDs are duplicated in the following cases:
- If the source data row is chunked into multiple chunks, then these rows in the knowledge base have the same auto-generated ID as they come from the same source data row.
- If the data from another data source is inserted, then the auto-generated sequence starts from zero again.
Get details about the knowledge base using the DESCRIBE
command.
Here is the sample output:
Insert into a Knowledge Base
Here is the syntax for inserting into a knowledge base:
Upon execution, it inserts data into a knowledge base, using the embedding model to embed it into vectors before inserting into an underlying vector database.
The following is an explanation of the syntax and parameters:
my_kb
is a unique identifier of a knowledge base.<id_col>
is a column that contains unique identifiers of data. The knowledge base expects theid
column name, unless specified differently in theid_column
parameter when creating the knowledge base.<text_col>
is a column that contains the text content. The knowledge base expects thecontent
column name, unless specified differently in thecontent_columns
parameter when creating the knowledge base.
Note that when inserting the same data into the knowledge base again, then no new rows are inserted, as the content already exists in the knowledge base.
In order to update existing data in the knowledge base, insert data with the column ID that you want to update and the updated content.
Search a Knowledge Base
Semantic Search at the Document Level
KBs provide semantic search returning document-level relevance.
Upon execution, it performs the semantic search workflow (including embedding the query, searching vectors, and optionally reranking) and returns document identifiers and relevance scores.
The following is an explanation of the syntax and parameters:
SELECT id, relevance
specifies that the unique document identifier and its calculated relevance score should be returned.<kb_name>
is a unique identifier of a knowledge base.WHERE content = '<query_text>'
triggers semantic search based on the meaning of<query_text>
.AND relevance_threshold = 0.6
is an optional filtering condition. It limits the output based on the relevance score and can have values between 0 and 1. It defines the minimum relevance value for the data row to appear in the output. Its default value is 0.LIMIT <N>
restricts results to the top N most relevant documents. When using the semantic search (as opposed to metadata-only search), the default limit is 10 and the maximum limit is 100.
Here are the conditions supported by the available vector stores:
Handler | IN, NOT IN | LIKE, NOT LIKE | < and > | != |
---|---|---|---|---|
PgVector | + | + | + | + |
ChromaDB | + | - | - | + |
Semantic Search at the Chunk Level
KBs provide semantic search returning chunk-level details.
Upon execution, it performs the semantic search workflow (including embedding the query, searching vectors, and optionally reranking) and returns details pertaining to the specific text chunks that matched the query most closely.
The following is an explanation of the syntax and parameters:
SELECT id, chunk_id, chunk_content, metadata, distance, relevance
specifies that the unique document identifier, chunk identifier, chunk content, metadata (if defined), distance, and relevance should be returned.<kb_name>
is a unique identifier of a knowledge base.WHERE content = '<query_text>'
triggers semantic search based on the meaning of<query_text>
.AND relevance_threshold = 0.6
is an optional filtering condition. It limits the output based on the relevance score and can have values between 0 and 1. It defines the minimum relevance value for the data row to appear in the output. Its default value is 0.LIMIT <N>
restricts results to the top N most relevant documents. When using the semantic search (as opposed to metadata-only search), the default limit is 10 and the maximum limit is 100.
Metadata Filtering with Semantic Search
KBs combine semantic search with standard SQL filtering on metadata.
Upon execution, it performs the conjunctive filtering, that is, results must satisfy both the metadata condition(s) and the semantic search condition.
The following is an explanation of the syntax and parameters:
SELECT id, chunk_content, relevance
specifies that the unique document identifier, chunk content, and the relevance score should be returned.<kb_name>
is a unique identifier of a knowledge base.WHERE <metadata_colums> = '<metadata>'
is the metadata filtering condition.WHERE content = '<query_text>'
is the semantic search condition.AND relevance_threshold = 0.6
is an optional filtering condition. It limits the output based on the relevance score and can have values between 0 and 1. It defines the minimum relevance value for the data row to appear in the output. Its default value is 0.LIMIT <N>
restricts results to the top N most relevant documents. When using the semantic search (as opposed to metadata-only search), the default limit is 10 and the maximum limit is 100.
JOIN
Support
KBs can be used in the standard SQL JOIN statements.
The following is an explanation of the syntax and parameters:
SELECT t.id, t.title, kb.chunk_content, kb.relevance
specifies the columns selected from a table (t
) and a knowledge base (kb
).WHERE <table_filter_condition>
sets conditions on a table.AND kb.content = '<query_text>'
sets a semantic search condition on a knowledge base.
Delete from a Knowledge Base
Here is the syntax for deleting from a knowledge base:
Upon execution, it identifies matching records based on the user-defined condition and removes all associated data (metadata, content, chunks, embeddings) for matching records from the KB’s storage.
The following is an explanation of the syntax and parameters:
<kb_name>
is a unique identifier of a knowledge base.<condition>
is a standard SQL WHERE clause condition operating on metadata columns.
Drop a Knowledge Base
Here is the syntax for deleting a knowledge base:
Upon execution, it removes the knowledge base with its content.
Was this page helpful?