Create a knowledge base
create()
function to create a knowledge base, specifying all its components.Insert data into the knowledge base
insert_query()
function to ingest data into the knowledge base from a query.Run semantic search on the knowledge base
Get the most relevant search results
relevance
parameter to receive only the best matching data for your use case.Filter results by metadata
create()
Functionmy_kb
and associates the specified models and storage. my_kb
is a unique identifier of the knowledge base within MindsDB.
embedding_model
and reranking_model
parameters.
provider = 'openai'
openai
as the model provider, users should define the following model parameters.
model_name
stores the name of the OpenAI model to be used.api_key
stores the OpenAI API key.provider = 'openai_azure'
openai_azure
as the model provider, users should define the following model parameters.
model_name
stores the name of the OpenAI model to be used.api_key
stores the OpenAI API key.base_url
stores the base URL of the Azure instance.api_version
stores the version of the Azure instance.provider = 'bedrock'
bedrock
as the model provider, users should define the following model parameters.
model_name
stores the name of the model available via Amazon Bedrock.aws_access_key_id
stores a unique identifier associated with your AWS account, used to identify the user or application making requests to AWS.aws_region_name
stores the name of the AWS region you want to send your requests to (e.g., "us-west-2"
).aws_secret_access_key
stores the secret key associated with your AWS access key ID. It is used to sign your requests securely.aws_session_token
stores a temporary token used for short-term security credentials when using AWS Identity and Access Management (IAM) roles or temporary credentials.provider = 'snowflake'
snowflake
as the model provider, users should choose one of the available models from Snowflake Cortex AI and define the following model parameters.
model_name
stores the name of the model available via Snowflake Cortex AI.api_key
stores the Snowflake Cortex AI API key.snowflake_account_id
stores the Snowflake account ID.embedding_model
embedding_model
parameter to define the specification.
default_embedding_model
in the configuration file, you do not need to provide the embedding_model
parameter when creating a knowledge base. If provide both, then the values from the embedding_model
parameter are used.provider
It is a required parameter. It defines the model provider as listed in supported LLMs.
model_name
It is a required parameter. It defines the embedding model name as specified by the provider.
api_key
The API key is required to access the embedding model assigned to a knowledge base. Users can provide it either in this api_key
parameter, or in the OPENAI_API_KEY
environment variable for "provider": "openai"
and AZURE_OPENAI_API_KEY
environment variable for "provider": "azure_openai"
.
base_url
It is an optional parameter, which defaults to https://api.openai.com/v1/
. It is a required parameter when using the azure_openai
provider. It is the root URL used to send API requests.
api_version
It is an optional parameter. It is a required parameter when using the azure_openai
provider. It defines the API version.
reranking_model
false
.
reranking_model
parameter to define the specification.
default_reranking_model
in the configuration file, you do not need to provide the reranking_model
parameter when creating a knowledge base. If provide both, then the values from the reranking_model
parameter are used.provider
It is a required parameter. It defines the model provider. Currently, the supported providers include OpenAI (openai
) and OpenAI via Azure (azure_openai
).
model_name
It is a required parameter. It defines the embedding model name as specified by the provider. Users can choose one of the OpenAI chat models.
api_key
The API key is required to access the embedding model assigned to a knowledge base. Users can provide it either in this api_key
parameter, or in the OPENAI_API_KEY
environment variable for "provider": "openai"
and AZURE_OPENAI_API_KEY
environment variable for "provider": "azure_openai"
.
base_url
It is an optional parameter, which defaults to https://api.openai.com/v1/
. It is a required parameter when using the azure_openai
provider. It is the root URL used to send API requests.
api_version
It is an optional parameter. It is a required parameter when using the azure_openai
provider. It defines the API version.
method
It is an optional parameter. It defines the method used to calculate the relevance of the output rows. The available options include multi-class
and binary
. It defaults to multi-class
.
multi-class
reranking method classifies each document chunk (that meets any specified metadata filtering conditions) into one of four relevance classes:relevance_score
of a document is calculated as the sum of each chunk’s class weight multiplied by its class probability (from model logprob output).The binary
reranking method simplifies classification by determining whether a document is relevant or not, without intermediate relevance levels. With this method, the overall relevance_score
of a document is calculated based on the model log probability.storage
storage
parameter. If not provided, the default ChromaDB is created when creating a knowledge base.
The available options include either PGVector or ChromaDB.
storage
parameter is not provided, the system creates the default ChromaDB vector database called <kb_name>_chromadb
with the default table called default_collection
that stores the embedded data. This default ChromaDB vector database is stored in MindsDB’s storage.
In order to provide the storage vector database, it is required to connect it to MindsDB beforehand.
Here is an example for PGVector.
storage_table
created as it is created when creating a knowledge base.metadata_columns
metadata_columns
cannot be used in content_columns
, and vice versa.id_column
and content_columns
) are considered metadata columns.
Here is an example of usage. A user wants to store the following data in a knowledge base.
product
column can be used as metadata to enable metadata filtering.
content_columns
content_columns
cannot be used in metadata_columns
, and vice versa.content
column is expected by default when inserting data into the knowledge base.
Here is an example of usage. A user wants to store the following data in a knowledge base.
notes
column can be used as content.
id_column
order_id
column can be used as ID.
id_column
parameter, the values from the provided source data column are used to identify source data rows within the knowledge base.
HASH()
or ROW_NUMBER()
.
id_column
parameter is not defined, its default values are build from the hash of the content columns and follow the format: <first 16 char of md5 hash of row content>
.
list()
and get()
Functionsget()
function.
list()
function.
insert()
Functionurls
: Base URLs to crawl.crawl_depth
: Depth for recursive crawling. Default is 1.filters
: Regex patterns to include.limit
: Max number of pages.information_schema.queries
table with the timestamp when it was ran.id_column
defined.
When users do not define the id_column
during the creation of a knowledge base, MindsDB generates the ID for each row using a hash of the content columns, as explained here.
Example:
If two rows have exactly the same content in the content columns, their hash (and thus their generated ID) will be the same.
Note that duplicate rows are skipped and not inserted.
Since both rows in the below table have the same content, only one row will be inserted.
name | age |
---|---|
Alice | 25 |
Alice | 25 |
id_column
defined.
When users define the id_column
during the creation of a knowledge base, then the knowledge base uses that column’s values as the row ID.
Example:
If the id_column
has duplicate values, the knowledge base skips the duplicate row(s) during the insert.
The second row in the below table has the same id
as the first row, so only one of these rows is inserted.
id | name | age |
---|---|---|
1 | Alice | 25 |
1 | Bob | 30 |
id_column
uniquely identifies each row to avoid unintentional data loss due to duplicate ID skipping.Laptop Stand
to Aluminum Laptop Stand
.
batch_size
defines the number of rows fetched per iteration to optimize data extraction from the source. It defaults to 1000.
threads
defines threads for running partitions. Note that if the ML task queue is enabled, threads are used automatically. The available values for threads
are:
threads = 10
,threads = true
, or disable threads, setting threads = false
.track_column
defines the column used for sorting data before partitioning.
error
defines the error processing options. The available values include raise
, used to raise errors as they come, or skip
, used to subside errors. It defaults to raise
if not provided.
INSERT INTO
statement with the above parameters, users can view the data insertion progress by querying the information_schema.queries
table.
information_schema.queries
table.
information_schema.queries
table and pass it as an argument to the query_cancel()
function. Note that canceling the query will not remove the already inserted data.
Users can resume the data insertion process using the process ID from the information_schema.queries
table.
information_schema.queries
table and pass it as an argument to the query_resume()
function. Note that resuming the query will not remove the already inserted data and will start appending the remaining data.
<id>:<chunk_number>of<total_chunks>:<start_char_number>to<end_char_number>
.
chunk_size
parameter defines the size of the chunk as the number of characters. And the chunk_overlap
parameter defines the number of characters that should overlap between subsequent chunks.
type
of chunking is set to json_chunking
, users can configure it by setting the following parameter values in the json_chunking_config
parameter:
flatten_nested
bool
data type with the default value of True
.include_metadata
bool
data type with the default value of True
.chunk_by_object
bool
data type with the default value of True
.True
) or create a single document (False
).
exclude_fields
List[str]
data type with the default value of an empty list.include_fields
List[str]
data type with the default value of an empty list.metadata_fields
List[str]
data type with the default value of an empty list.extract_all_primitives
bool
data type with the default value of False
.nested_delimiter
str
data type with the default value of "."
.content_column
str
data type with the default value of "content"
.find()
Functionid
It stores values from the column defined in the id_column
parameter when creating the knowledge base. These are the source data IDs.
chunk_id
Knowledge bases chunk the inserted data in order to fit the defined chunk size. If the chunking is performed, the following chunk ID format is used: <id>:<chunk_number>of<total_chunks>:<start_char_number>to<end_char_number>
.
chunk_content
It stores values from the column(s) defined in the content_columns
parameter when creating the knowledge base.
metadata
It stores the general metadata and the metadata defined in the metadata_columns
parameter when creating the knowledge base.
distance
It stores the calculated distance between the chunk’s content and the search phrase.
relevance
It stores the calculated relevance of the chunk as compared to the search phrase. Its values are between 0 and 1.
relevance
differs as follows:relevance
is equal or greater than 0, unless defined otherwise in the WHERE
clause.relevance
is not defined in the query, then no relevance filtering is applied and the output includes all rows matched based on the similarity and metadata search.relevance
is defined in the query, then the relevance is calculated based on the distance
column (1/(1+ distance)
) and the relevance
value is compared with this relevance value to filter the output.content
) to be searched for.
relevance
LIMIT
relevance
and LIMIT
as follows:LIMIT
clause) that match the defined content
. Next, these set of rows is filtered out to match the defined relevance
.relevance
in order to get only the most relevant results.
relevance
filter, the output is limited to only data with relevance score of the provided value. The available values of relevance
are between 0 and 1, and its default value covers all available relevance values ensuring no filtering based on the relevance score.
Users can limit the number of rows returned.
relevance
column values are not calculated.
Users can do both, filter by metadata and search by content.
drop()
Function