A knowledge base is a batteries-included RAG system that you can create and insert data into, as well as query as if it were a table.

Internally, knowledge bases use a vector store and an embedding model. By default, it uses the ChromaDB vector store and the OpenAI embedding model, which requires an OpenAI API key set as an evironment variable. You can define a vector store and an embedding model as in the examples below.

Syntax

Here is a general syntax for creating a knowledge base:

CREATE KNOWLEDGE_BASE my_kb
USING
   model = embedding_model, -- optional
   storage = vector_database.storage_table, -- optional
   metadata_columns = ['date', 'creator', ...], -- optional
   content_columns = ['review', 'content', ...], -- optional
   id_column = 'index'; -- optional

Where all the parameters are optional:

  • model: This is an embedding model created within MindsDB with CREATE MODEL embedding_model. If you do not provide the model parameter, then the OpenAI’s text-embedding-ada-002 model is used by default, provided that the OPENAI_API_KEY environment variable is defined.
  • storage: This is a vector database that stores embedded data. It defaults to ChromaDB, or you can connect your vector store to MindsDB with CREATE DATABASE vector_database.
  • metadata_columns: The list of column names that will be stored in the metadata column of the knowledge base. If not set, the metadata column is not used.
  • content_columns: The list of column names that will be stored in the content column of the knowledge base. If not set, all columns are stored in the content column.
  • id_column: The column name that will be stored in the id column of the knowledge base to uniquely identitify the data.

Each knowledge base comprises the id, content, and metadata columns that store data defined in the parameters.

Examples

This section presents examples of how to create knowledge bases and insert data for storage in the form of embeddings.

Knowledge Base with OpenAI Embedding Model

Note that using OpenAI’s embedding model requires OpenAI API key.

First, create an engine through which the model is accessed.

CREATE ML_ENGINE embedding
FROM langchain_embedding;

Next, create an embedding model, providing an OpenAI API key.

CREATE MODEL embedding_model
PREDICT embeddings
USING
      engine = "embedding",
      class = "openai",
      model = "text-embedding-ada-002",
      openai_api_key = "sk-xxx",
      input_columns = ["content"];

Analyze the data that you want to insert into a knowledge base:

SELECT *
FROM example_db.home_rentals;

Here is the output:

+-------------------+-------------------+------+----------+----------------+----------------+--------------+----------------------------+
| number_of_rooms   | number_of_bathrooms | sqft | location | days_on_market | neighborhood   | rental_price | created_at                 |
+-------------------+-------------------+------+----------+----------------+----------------+--------------+----------------------------+
| 2                 | 1                 | 917  | great    | 13             | berkeley_hills | 3901         | 2024-02-24 02:28:21.746167 |
| 0                 | 1                 | 194  | great    | 10             | berkeley_hills | 2042         | 2024-02-19 06:10:59.693052 |
| 1                 | 1                 | 543  | poor     | 18             | westbrae       | 1871         | 2024-02-12 07:53:45.914146 |
+-------------------+-------------------+------+----------+----------------+----------------+--------------+----------------------------+

Decide which columns to use as content and which ones as metadata. For example, we use the days_on_market and neighborhood columns as metadata and the location and rental_price columns as content.

Now that you have an embedding model, create a knowledge base, passing this embedding model and defining the content and metadata columns.

CREATE KNOWLEDGE_BASE kb_custom_model
USING
	model = embedding_model,
      metadata_columns = ['days_on_market', 'neighborhood'],
      content_columns = ['location', 'rental_price'];

After successful creation of a knowledge base, insert data to store it in the form of embeddings.

INSERT INTO kb_custom_model (
      SELECT *
      FROM example_db.home_rentals
);

Finally, you can verify that the data has been inserted into the knowledge base by querying it.

SELECT * FROM kb_custom_model;

Here is the output:

+----+------------------------------+----------------------------------------+
| id | content                      | metadata                               |
+----+------------------------------+----------------------------------------+
| 1  | days_on_market: 13           | {"location":great,"rental_price":3901} |
|    | neighborhood: berkeley_hills |                                        |
| 2  | days_on_market: 10           | {"location":great,"rental_price":2042} |
|    | neighborhood: berkeley_hills |                                        |
| 3  | days_on_market: 18           | {"location":poor,"rental_price":1871}  |
|    | neighborhood: westbrae       |                                        |
+----+------------------------------+----------------------------------------+

Knowledge Base with Hugging Face Embedding Model

This example uses an open source embedding model.

First, create an engine through which the model is accessed.

CREATE ML_ENGINE embedding
FROM langchain_embedding;

Next, create an embedding model.

CREATE MODEL embedding_model
PREDICT embedding
USING
      engine = 'langchain_embedding',
      class = 'HuggingFaceEmbeddings',
      model_name = "sentence-transformers/all-mpnet-base-v2";

Now that you have an embedding model, create a knowledge base, passing this embedding model.

CREATE KNOWLEDGE_BASE kb_custom_model
USING
	model = embedding_model;

After successful creation of a knowledge base, insert data to store it in the form of embeddings.

INSERT INTO kb_custom_model (
      SELECT review as content 
      FROM example_db.amazon_reviews
);

Finally, you can verify that the data has been inserted into the knowledge base by querying it.

SELECT * FROM kb_custom_model;

Knowledge Base with Custom Vector Store

This example shows how to create a knowledge base with custom vector database.

First, connect to your vector database. Here, it is ChromaDB.

CREATE DATABASE chroma_dev_local
WITH ENGINE = "chromadb",
PARAMETERS = {
	    "persist_directory": "persist path here"
};

Create an index in the vector store and insert one example point.

CREATE TABLE chroma_dev_local.world_news_index (
	SELECT content, embeddings FROM embedding_model
	WHERE content = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Tristique sollicitudin nibh sit amet commodo nulla. Risus sed vulputate odio ut enim blandit volutpat. Suspendisse ultrices gravida dictum fusce ut placerat orci. Eget nulla facilisi etiam dignissim diam. Aenean euismod elementum nisi quis eleifend quam. Ac placerat vestibulum lectus mauris ultrices eros in. Sed turpis tincidunt id aliquet risus feugiat in ante metus. Pellentesque habitant morbi tristique senectus et netus. Imperdiet massa tincidunt nunc pulvinar sapien et ligula. Leo in vitae turpis massa sed elementum tempus egestas. Aliquam malesuada bibendum arcu vitae elementum curabitur. Sit amet tellus cras adipiscing. Enim ut tellus elementum sagittis vitae et leo. Donec pretium vulputate sapien nec."
);

Next, create a knowledge base, passing this vector database connection.

CREATE KNOWLEDGE_BASE world_news_kb
USING
      storage = chroma_dev_local.world_news_index;

Automate Data Sync with Knowledge Base

This example shows how to set up a job that inserts newly available data from your database into a knowledge base.

Here is how you can automate adding content to the knowledge base every time new data becomes available:

CREATE JOB keep_knowledge_base_up_to_date AS (

    INSERT INTO world_news_kb (
        SELECT id, text AS content
        FROM world_news_with_ids
        WHERE id > LAST
    );

) EVERY second;

The LAST keyword enables the quey to select only the newly added data. Learn more about the LAST keyword here.

Query the knowledge base as below.

SELECT *
FROM world_news_kb;

SELECT *
FROM world_news_kb
WHERE content = "YouTube" LIMIT 1;

SELECT *
FROM world_news_kb
WHERE content = "Canada and Google" LIMIT 1;