How to Insert Data into Knowledge Bases
Knowledge Bases organize data across data sources, including databases, files, documents, webpages. Users can insert data in partitions and specify the chunking startegy.
INSERT INTO
Syntax
Here is the syntax for inserting data into a knowledge base:
Upon execution, it inserts data into a knowledge base, using the embedding model to embed it into vectors before inserting into an underlying vector database.
The status of the INSERT INTO
is logged in the information_schema.queries
table with the timestamp when it was ran, and can be queried as follows:
To speed up the initial loading of a knowledge base with data, run the INSERT INTO
command with the kb_no_upsert
flag. It is recommended to use this flag only when the knowledge base is empty, that is, only for the initial data insertion.
Handling duplicate data while inserting into the knowledge base
Knowledge bases uniquely identify data rows using an ID column, which prevents from inserting duplicate data, as follows.
-
Case 1: Inserting data into the knowledge base without the
id_column
defined.When users do not define the
id_column
during the creation of a knowledge base, MindsDB generates the ID for each row using a hash of the content columns, as explained here.Example:
If two rows have exactly the same content in the content columns, their hash (and thus their generated ID) will be the same.
Note that duplicate rows are skipped and not inserted.
Since both rows in the below table have the same content, only one row will be inserted.
name age Alice 25 Alice 25 -
Case 2: Inserting data into the knowledge base with the
id_column
defined.When users define the
id_column
during the creation of a knowledge base, then the knowledge base uses that column’s values as the row ID.Example:
If the
id_column
has duplicate values, the knowledge base skips the duplicate row(s) during the insert.The second row in the below table has the same
id
as the first row, so only one of these rows is inserted.id name age 1 Alice 25 1 Bob 30
Best practice
Ensure the id_column
uniquely identifies each row to avoid unintentional data loss due to duplicate ID skipping.
Update Existing Data
In order to update existing data in the knowledge base, insert data with the column ID that you want to update and the updated content.
Here is an example of usage. A knowledge base stores the following data.
A user updated Laptop Stand
to Aluminum Laptop Stand
.
Go to the Complete Example section below to find out how to access this sample data.
Here is how to propagate this change into the knowledge base.
The knowledge base matches the ID value to the existing one and updates the data if required.
Insert Data using Partitions
In order to optimize the performance of data insertion into the knowledge base, users can set up partitions and threads to insert batches of data in parallel. This also enables tracking the progress of data insertion process including cancelling and resuming it if required.
Here is an example.
The parameters include the following:
-
batch_size
defines the number of rows fetched per iteration to optimize data extraction from the source. It defaults to 1000. -
threads
defines threads for running partitions. Note that if the ML task queue is enabled, threads are used automatically. The available values forthreads
are:- a number of threads to be used, for example,
threads = 10
, - a boolean value that defines whether to enable threads, setting
threads = true
, or disable threads, settingthreads = false
.
- a number of threads to be used, for example,
-
track_column
defines the column used for sorting data before partitioning. -
error
defines the error processing options. The available values includeraise
, used to raise errors as they come, orskip
, used to subside errors. It defaults toraise
if not provided.
After executing the INSERT INTO
statement with the above parameters, users can view the data insertion progress by querying the information_schema.queries
table.
Users can cancel the data insertion process using the process ID from the information_schema.queries
table.
Note that canceling the query will not remove the already inserted data.
Users can resume the data insertion process using the process ID from the information_schema.queries
table.
Chunking Data
Upon inserting data into the knowledge base, the data chunking is performed in order to optimize the storage and search of data.
Each chunk is identified by its chunk ID of the following format: <id>:<chunk_number>of<total_chunks>:<start_char_number>to<end_char_number>
.
Text
Users can opt for defining the chunking parameters when creating a knowledge base.
The chunk_size
parameter defines the size of the chunk as the number of characters. And the chunk_overlap
parameter defines the number of characters that should overlap between subsequent chunks.
JSON
Users can opt for defining the chunking parameters specifically for JSON data.
When the type
of chunking is set to json_chunking
, users can configure it by setting the following parameter values in the json_chunking_config
parameter:
-
flatten_nested
It is of thebool
data type with the default value ofTrue
.
It defines whether to flatten nested JSON structures. -
include_metadata
It is of thebool
data type with the default value ofTrue
.
It defines whether to include original metadata in chunks. -
chunk_by_object
It is of thebool
data type with the default value ofTrue
.
It defines whether to chunk by top-level objects (True
) or create a single document (False
). -
exclude_fields
It is of theList[str]
data type with the default value of an empty list.
It defines the list of fields to exclude from chunking. -
include_fields
It is of theList[str]
data type with the default value of an empty list.
It defines the list of fields to include in chunking (if empty, all fields except excluded ones are included). -
metadata_fields
It is of theList[str]
data type with the default value of an empty list.
It defines the list of fields to extract into metadata for filtering (can include nested fields using dot notation). If empty, all primitive fields will be extracted (top-level fields if available, otherwise all primitive fields in the flattened structure). -
extract_all_primitives
It is of thebool
data type with the default value ofFalse
.
It defines whether to extract all primitive values (strings, numbers, booleans) into metadata. -
nested_delimiter
It is of thestr
data type with the default value of"."
.
It defines the delimiter for flattened nested field names. -
content_column
It is of thestr
data type with the default value of"content"
.
It defines the name of the content column for chunk ID generation.
Underlying Vector Store
Each knowledge base has its underlying vector store that stores data inserted into the knowledge base in the form of embeddings.
Users can query the underlying vector store as follows.
- KB with the default ChromaDB vector store:
Example
Here a sample knowledge base created in the previous Example section is inserted into.
When inserting into a knowledge base where the content_columns
parameter was not specified, the column storing content must be aliased AS content
as below.
DELETE FROM
Syntax
Here is the syntax for deleting from a knowledge base:
CREATE INDEX ON KNOWLEDGE_BASE
Syntax
Users can create index on the knowledge base to speed up the search operations.
Note that this feature works only when PGVector is used as the storage of the knowledge base, as ChromaDB provides the index features by default.
Upon executing this statement, an index is created on the knowledge base’s underlying vector store. This is essentially a database index created on the vector database.
Note that having an index on the knowledge base may reduce the speed of the insert operations. Therefore, it is recommended to insert bulk data into the knowledge base before creating an index. The index improves performance of querying the knowledge base, while it may slow down subsequent data inserts.