A knowledge base is a batteries included RAG system that you can create and insert data into, as well as query as if it was a table.

-- Create a knowledge base using the OpenAI embedding model & Chroma world news index
CREATE KNOWLEDGE_BASE world_news_kb;

-- Add data into the knowledge base
INSERT INTO world_news_kb (SELECT id, text as content FROM web_news);

-- You can query the knowledge base as virtual table
SELECT * FROM world_news_kb WHERE content = "Best AI Startups in SF?" LIMIT 10;

Internally, it uses a vector store and an embedding model, by default it uses chroma and it will pick a suitable embedding model for the task, however, you can specify each of these if you want.

Here is a general syntax for creating a knowledge base:

CREATE KNOWLEDGE_BASE my_kb
[USING
   [model = embedding_model,]
   [storage = vector_database.storage_table;]]

Managing input columns

Knowledge base can accept optional columns parameters to define where id, content and metadata columns are located:

CREATE KNOWLEDGE_BASE my_kb
USING
   metadata_columns = ['date', 'creator'],  -- optional, if not set: no metadata columns
   content_columns = ['review'],  -- optional, if not set: all columns is content
   id_column='index'   -- optional, default: id

Step by step guide

Here is how you build a KNOWLEDGE BASE and specifying the embedding model and the vector store you want:

Create the embedding model

-- Set up embedding model that uses OpenAI
CREATE ML_ENGINE embedding FROM langchain_embedding;
CREATE MODEL embedding_model
PREDICT embeddings
USING
	engine = "embedding",
	class = "openai",
	openai_api_key = "your_api_key_here",
	input_columns = ["content"];

Connect the vector store

-- Connect to Chroma
CREATE DATABASE chroma_dev_local
WITH ENGINE = "chromadb",
PARAMETERS = {
	"persist_directory": "persist path here"
};

In this example, there is a table called world_news_with_ids that contains columns (id, content)

SELECT * FROM world_news_with_ids;

Now, we can create an index in the vector store, and insert one example point:

CREATE TABLE chroma_dev_local.world_news_index (
	SELECT content, embeddings FROM embedding_model
	WHERE content = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Tristique sollicitudin nibh sit amet commodo nulla. Risus sed vulputate odio ut enim blandit volutpat. Suspendisse ultrices gravida dictum fusce ut placerat orci. Eget nulla facilisi etiam dignissim diam. Aenean euismod elementum nisi quis eleifend quam. Ac placerat vestibulum lectus mauris ultrices eros in. Sed turpis tincidunt id aliquet risus feugiat in ante metus. Pellentesque habitant morbi tristique senectus et netus. Imperdiet massa tincidunt nunc pulvinar sapien et ligula. Leo in vitae turpis massa sed elementum tempus egestas. Aliquam malesuada bibendum arcu vitae elementum curabitur. Sit amet tellus cras adipiscing. Enim ut tellus elementum sagittis vitae et leo. Donec pretium vulputate sapien nec."
);

Let’s see what we have in the vector store:

SELECT * FROM chroma_dev_local.world_news_index;

Glue it all together

Now, we can connect the embedding model and the vector index into one knowledge base:

-- Create a knowledge base using the OpenAI embedding model & Chroma world news index
CREATE KNOWLEDGE_BASE world_news_kb
USING
	model = embedding_model,
	storage = chroma_dev_local.world_news_index;

This is how we add content into the vector store every few seconds if there is new data:

-- Insert world news data into Knowledge Base every second, if there is new data
CREATE JOB keep_knowledge_base_up_to_date AS (

    INSERT INTO world_news_kb (
        SELECT id, text as content FROM world_news_with_ids WHERE id > LAST
    );

) every second;

Note: the query has a id > LAST. This allows the job, to only select new data every time it runs.

Query the knowledge base

-- Do a similarity search over the knowledge base
SELECT * FROM world_news_kb;
SELECT * FROM world_news_kb WHERE content = "YouTube" LIMIT 1;
SELECT * FROM world_news_kb WHERE content = "Canada and Google" LIMIT 1;