> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mindsdb.com/llms.txt
> Use this file to discover all available pages before exploring further.

# How to Create Knowledge Bases

A knowledge base is an advanced system that organizes information based on semantic meaning rather than simple keyword matching. It integrates embedding models, reranking models, and vector stores to enable context-aware data retrieval.

## `CREATE KNOWLEDGE_BASE` Syntax

Here is the syntax for creating a knowledge base:

```sql theme={null}
CREATE KNOWLEDGE_BASE my_kb
USING
    embedding_model = {
        "provider": "openai",
        "model_name" : "text-embedding-3-large",
        "api_key": "sk-..."
    },
    reranking_model = {
        "provider": "openai",
        "model_name": "gpt-4o",
        "api_key": "sk-..."
    },
    storage = my_vector_store.storage_table,
    metadata_columns = ['date', 'creator', ...],
    content_columns = ['review', 'content', ...],
    id_column = 'id';
```

Upon execution, it registers `my_kb` and associates the specified models and storage. `my_kb` is a unique identifier of the knowledge base within MindsDB.

Here is how to list all knowledge bases:

```sql theme={null}
SHOW KNOWLEDGE_BASES;
```

<Tip>
  Users can use the variables and the [`from_env()` function](/mindsdb_sql/functions/from_env) to pass parameters when creating knowledge bases.
</Tip>

<Note>
  As MindsDB stores objects, such as models or knowledge bases, inside [projects](/mindsdb_sql/sql/create/project), you can create a knowledge base inside a custom project.

  ```sql theme={null}
  CREATE PROJECT my_project;

  CREATE KNOWLEDGE_BASE my_project.my_kb
  USING
      ...
  ```
</Note>

### Supported LLMs

Below is the list of all language models supported for the `embedding_model` and `reranking_model` parameters.

#### `provider = 'openai'`

This provider is supported for both `embedding_model` and `reranking_model`.

<Note>
  Users can define the default embedding and reranking models from OpenAI in Settings of the MindsDB GUI.

  Furthermore, users can select `Custom OpenAI API` from the dropdown and use models from any OpenAI-compatible API.
</Note>

When choosing `openai` as the model provider, users should define the following model parameters.

* `model_name` stores the name of the OpenAI model to be used.
* `api_key` stores the OpenAI API key.

Learn more about the [OpenAI integration with MindsDB here](/integrations/ai-engines/openai).

#### `provider = 'openai_azure'`

This provider is supported for both `embedding_model` and `reranking_model`.

<Note>
  Users can define the default embedding and reranking models from Azure OpenAI in Settings of the MindsDB GUI.
</Note>

When choosing `openai_azure` as the model provider, users should define the following model parameters.

* `model_name` stores the name of the OpenAI model to be used.
* `api_key` stores the OpenAI API key.
* `base_url` stores the base URL of the Azure instance.
* `api_version` stores the version of the Azure instance.

<Tip>
  Users need to log in to their Azure OpenAI instance to retrieve all relevant parameter values. Next, click on `Explore Azure AI Foundry portal` and go to `Models + endpoints`. Select the model and copy the parameter values.
</Tip>

#### `provider = 'google'`

This provider is supported for both `embedding_model` and `reranking_model`.

<Note>
  Users can define the default embedding and reranking models from Google in Settings of the MindsDB GUI.
</Note>

When choosing `google` as the model provider, users should define the following model parameters.

* `model_name` stores the name of the Google model to be used.
* `api_key` stores the Google API key.

Learn more about the [Google Gemini integration with MindsDB here](/integrations/ai-engines/google_gemini).

#### `provider = 'bedrock'`

This provider is supported for both `embedding_model` and `reranking_model`.

When choosing `bedrock` as the model provider, users should define the following model parameters.

* `model_name` stores the name of the model available via Amazon Bedrock.
* `aws_access_key_id` stores a unique identifier associated with your AWS account, used to identify the user or application making requests to AWS.
* `aws_region_name` stores the name of the AWS region you want to send your requests to (e.g., `"us-west-2"`).
* `aws_secret_access_key` stores the secret key associated with your AWS access key ID. It is used to sign your requests securely.
* `aws_session_token` is an optional parameter that stores a temporary token used for short-term security credentials when using AWS Identity and Access Management (IAM) roles or temporary credentials.

#### `provider = 'snowflake'`

This provider is supported for both `embedding_model` and `reranking_model`.

When choosing `snowflake` as the model provider, users should choose one of the available models from [Snowflake Cortex AI](https://www.snowflake.com/en/product/features/cortex/) and define the following model parameters.

* `model_name` stores the name of the model available via Snowflake Cortex AI.
* `api_key` stores the Snowflake Cortex AI API key.
* `account_id` stores the Snowflake account ID.

<Accordion title="How to Generate the API key of Snowflake Cortex AI">
  Follow the below steps to generate the API key.

  1. Generate a key pair according to [this instruction](https://docs.snowflake.com/en/user-guide/key-pair-auth) as below.

     * Execute these commands in the console:

       ```bash theme={null}
       # generate private key
       openssl genrsa 2048 | openssl pkcs8 -topk8 -inform PEM -out rsa_key.p8 -nocrypt
       # generate public key
       openssl rsa -in rsa_key.p8 -pubout -out rsa_key.pub
       ```

     * Save the public key, that is, the content of rsa\_key.pub, into your database user:

       ```sql theme={null}
       ALTER USER my_user SET RSA_PUBLIC_KEY = "<content of rsa_key.pub>"
       ```

  2. Verify the key pair with the database user.

     * Install `snowsql` following [this instruction](https://docs.snowflake.com/en/user-guide/snowsql-install-config).

     * Execute this command in the console:

       ```bash theme={null}
       snowsql -a <snowflake-account-id> -u my_user --private-key-path rsa_key.p8
       ```

  3. Generate JWT token.

     * Download the Python script from [Snowflake's Developer Guide for Authentication](https://docs.snowflake.com/en/developer-guide/sql-api/authenticating). Here is a [direct download link](https://docs.snowflake.com/en/_downloads/aeb84cdfe91dcfbd889465403b875515/sql-api-generate-jwt.py).

     * Ensure to have the PyJWT module installed that is required for running the script.

     * Run the script using this command:

       ```bash theme={null}
       sql-api-generate-jwt.py --account <snowflake-account-id> --user my_user --private_key_file_path rsa_key.p8
       ```

       This command returns the JWT token, which is used in the `api_key` parameter for the `snowflake` provider.
</Accordion>

#### `provider = 'ollama'`

This provider is supported for both `embedding_model` and `reranking_model`.

<Note>
  Users can define the default embedding and reranking models from Ollama in Settings of the MindsDB GUI.
</Note>

When choosing `ollama` as the model provider, users should define the following model parameters.

* `model_name` stores the name of the model to be used.
* `base_url` stores the base URL of the Ollama instance.

### `embedding_model`

The embedding model is a required component of the knowledge base. It stores specifications of the embedding model to be used.

Users can define the embedding model choosing one of the following options.

**Option 1.** Use the `embedding_model` parameter to define the specification.

```sql theme={null}
CREATE KNOWLEDGE_BASE my_kb
USING
    ...
    embedding_model = {

        "provider": "azure_openai",
        "model_name" : "text-embedding-3-large",
        "api_key": "sk-abc123",
        "base_url": "https://ai-6689.openai.azure.com/",
        "api_version": "2024-02-01"

    },
    ...
```

**Option 2.** Define the default embedding model in the [MindsDB configuration file](/setup/custom-config).

<Tip>
  You can define the default models in the Settings of the MindsDB Editor GUI.
</Tip>

<Note>
  Note that if you define [`default_embedding_model` in the configuration file](/setup/custom-config#default_embedding_model), you do not need to provide the `embedding_model` parameter when creating a knowledge base. If provide both, then the values from the `embedding_model` parameter are used.

  When using `default_embedding_model` from the configuration file, the knowledge base saves this model internally. Therefore, when changing `default_embedding_model` in the configuration file to a different one after the knowledge base is created, it does not affect the already created knowledge bases.
</Note>

```bash theme={null}
"default_embedding_model": {

   "provider": "azure_openai",
   "model_name" : "text-embedding-3-large",
   "api_key": "sk-abc123",
   "base_url": "https://ai-6689.openai.azure.com/",
   "api_version": "2024-02-01"

}
```

The embedding model specification includes:

* `provider`
  It is a required parameter. It defines the model provider.

* `model_name`
  It is a required parameter. It defines the embedding model name as specified by the provider.

* `api_key`
  The API key is required to access the embedding model assigned to a knowledge base. Users can provide it either in this `api_key` parameter, or in the `OPENAI_API_KEY` environment variable for `"provider": "openai"` and `AZURE_OPENAI_API_KEY` environment variable for `"provider": "azure_openai"`.

* `base_url`
  It is an optional parameter, which defaults to `https://api.openai.com/v1/`. It is a required parameter when using the `azure_openai` provider. It is the root URL used to send API requests.

* `api_version`
  It is an optional parameter. It is a required parameter when using the `azure_openai` provider. It defines the API version.

### `reranking_model`

The reranking model is an optional component of the knowledge base. It stores specifications of the reranking model to be used.

Users can disable reranking features of knowledge bases by setting this parameter to `false`.

```sql theme={null}
CREATE KNOWLEDGE_BASE my_kb
USING
    ...
    reranking_model = false,
    ...
```

Users can enable reranking features of knowledge bases by defining the reranking model choosing one of the following options.

**Option 1.** Use the `reranking_model` parameter to define the specification.

```sql theme={null}
CREATE KNOWLEDGE_BASE my_kb
USING
    ...
    reranking_model = {

        "provider": "azure_openai",
        "model_name" : "gpt-4o",
        "api_key": "sk-abc123",
        "base_url": "https://ai-6689.openai.azure.com/",
        "api_version": "2024-02-01",
        "method": "multi-class"

    },
    ...
```

**Option 2.** Define the default reranking model in the [MindsDB configuration file](/setup/custom-config).

<Tip>
  You can define the default models in the Settings of the MindsDB Editor GUI.
</Tip>

<Note>
  Note that if you define [`default_reranking_model` in the configuration file](/setup/custom-config#default-reranking-model), you do not need to provide the `reranking_model` parameter when creating a knowledge base. If provide both, then the values from the `reranking_model` parameter are used.

  When using `default_reranking_model` from the configuration file, the knowledge base saves this model internally. Therefore, when changing `default_reranking_model` in the configuration file to a different one after the knowledge base is created, it does not affect the already created knowledge bases.
</Note>

```bash theme={null}
"default_reranking_model": {

  "provider": "azure_openai",
  "model_name" : "gpt-4o",
  "api_key": "sk-abc123",
  "base_url": "https://ai-6689.openai.azure.com/",
  "api_version": "2024-02-01",
  "method": "multi-class"

}
```

The reranking model specification includes:

* `provider`
  It is a required parameter. It defines the model provider as listed in [supported LLMs](/mindsdb_sql/knowledge_bases/create#supported-llms).

* `model_name`
  It is a required parameter. It defines the embedding model name as specified by the provider.

* `api_key`
  The API key is required to access the embedding model assigned to a knowledge base. Users can provide it either in this `api_key` parameter, or in the `OPENAI_API_KEY` environment variable for `"provider": "openai"` and `AZURE_OPENAI_API_KEY` environment variable for `"provider": "azure_openai"`.

* `base_url`
  It is an optional parameter, which defaults to `https://api.openai.com/v1/`. It is a required parameter when using the `azure_openai` provider. It is the root URL used to send API requests.

* `api_version`
  It is an optional parameter. It is a required parameter when using the `azure_openai` provider. It defines the API version.

* `method`
  It is an optional parameter. It defines the method used to calculate the relevance of the output rows. The available options include `multi-class` and `binary`. It defaults to `multi-class`.

<Info>
  **Reranking Method**

  The `multi-class` reranking method classifies each document chunk (that meets any specified metadata filtering conditions) into one of four relevance classes:

  1. Not relevant with class weight of 0.25.
  2. Slightly relevant with class weight of 0.5.
  3. Moderately relevant with class weight of 0.75.
  4. Highly relevant with class weight of 1.

  The overall `relevance_score` of a document is calculated as the sum of each chunk’s class weight multiplied by its class probability (from model logprob output).

  The `binary` reranking method simplifies classification by determining whether a document is relevant or not, without intermediate relevance levels. With this method, the overall `relevance_score` of a document is calculated based on the model log probability.
</Info>

### `storage`

The vector store is a required component of the knowledge base. It stores data in the form of embeddings.

<Note>
  When using [MindsDB via Docker Desktop Extension](/setup/self-hosted/docker-desktop), the `storage` parameter is optional. The default storage is PGVector that is integrated into the MindsDB Docker Desktop Extension.

  When using [MindsDB via PyPI](/contribute/install) or [MindsDB via Docker image](/setup/self-hosted/docker), the `storage` parameter is required. Users need to connect their vector storage, such as PGVector, using the `CREATE DATABASE` command and then use it for the storage of the knowledge bases.
</Note>

The recommended vector store for knowledge bases is [PGVector](/integrations/vector-db-integrations/pgvector), specifically, version 0.8.0 or higher for a better performance.

In order to provide the storage vector database, it is required to connect it to MindsDB beforehand.

Here is an example for [PGVector](/integrations/vector-db-integrations/pgvector).

```sql theme={null}
CREATE DATABASE my_pgvector
WITH ENGINE = 'pgvector',
PARAMETERS = {
    "host": "127.0.0.1",
    "port": 5432,
    "database": "postgres",
    "user": "user",
    "password": "password",
    "distance": "cosine"
};

CREATE KNOWLEDGE_BASE my_kb
USING
    ...
    storage = my_pgvector.storage_table,
    ...
```

<Info>
  Note that you do not need to have the `storage_table` created as it is created when creating a knowledge base.
</Info>

### `metadata_columns`

The data inserted into the knowledge base can be classified as metadata, which enables users to filter the search results using defined data fields.

<Note>
  Note that source data column(s) included in `metadata_columns` cannot be used in `content_columns`, and vice versa.
</Note>

This parameter is an array of strings that lists column names from the source data to be used as metadata. If not provided, then all inserted columns (except for columns defined as `id_column` and `content_columns`) are considered metadata columns.

Here is an example of usage. A user wants to store the following data in a knowledge base.

```sql theme={null}
+----------+-------------------+------------------------+
| order_id | product           | notes                  |
+----------+-------------------+------------------------+
| A1B      | Wireless Mouse    | Request color: black   |
| 3XZ      | Bluetooth Speaker | Gift wrap requested    |
| Q7P      | Laptop Stand      | Prefer aluminum finish |
+----------+-------------------+------------------------+
```

<Tip>
  Go to the *Complete Example* section below to find out how to access this sample data.
</Tip>

The `product` column can be used as metadata to enable metadata filtering.

```sql theme={null}
CREATE KNOWLEDGE_BASE my_kb
USING
    ...
    metadata_columns = ['product'],
    ...
```

### `content_columns`

The data inserted into the knowledge base can be classified as content, which is embedded by the embedding model and stored in the underlying vector store.

<Note>
  Note that source data column(s) included in `content_columns` cannot be used in `metadata_columns`, and vice versa.
</Note>

This parameter is an array of strings that lists column names from the source data to be used as content and processed into embeddings. If not provided, the `content` column is expected by default when inserting data into the knowledge base.

Here is an example of usage. A user wants to store the following data in a knowledge base.

```sql theme={null}
+----------+-------------------+------------------------+
| order_id | product           | notes                  |
+----------+-------------------+------------------------+
| A1B      | Wireless Mouse    | Request color: black   |
| 3XZ      | Bluetooth Speaker | Gift wrap requested    |
| Q7P      | Laptop Stand      | Prefer aluminum finish |
+----------+-------------------+------------------------+
```

<Tip>
  Go to the *Complete Example* section below to find out how to access this sample data.
</Tip>

The `notes` column can be used as content.

```sql theme={null}
CREATE KNOWLEDGE_BASE my_kb
USING
    ...
    content_columns = ['notes'],
    ...
```

### `id_column`

The ID column uniquely identifies each source data row in the knowledge base.

It is an optional parameter. If provided, this parameter is a string that contains the source data ID column name. If not provided, it is generated from the hash of the content columns.

Here is an example of usage. A user wants to store the following data in a knowledge base.

```sql theme={null}
+----------+-------------------+------------------------+
| order_id | product           | notes                  |
+----------+-------------------+------------------------+
| A1B      | Wireless Mouse    | Request color: black   |
| 3XZ      | Bluetooth Speaker | Gift wrap requested    |
| Q7P      | Laptop Stand      | Prefer aluminum finish |
+----------+-------------------+------------------------+
```

<Tip>
  Go to the *Complete Example* section below to find out how to access this sample data.
</Tip>

The `order_id` column can be used as ID.

```sql theme={null}
CREATE KNOWLEDGE_BASE my_kb
USING
    ...
    id_column = 'order_id',
    ...
```

<Info>
  Note that if the source data row is chunked into multiple chunks by the knowledge base (that is, to optimize the storage), then these rows in the knowledge base have the same ID value that identifies chunks from one source data row.
</Info>

<Note>
  **Available options for the ID column values**

  * User-Defined ID Column: <br />
    When users defined the `id_column` parameter, the values from the provided source data column are used to identify source data rows within the knowledge base.

  * User-Generated ID Column: <br />
    When users do not have a column that uniquely identifies each row in their source data, they can generate the ID column values when inserting data into the knowledge base using functions like `HASH()` or `ROW_NUMBER()`.

  ```sql theme={null}
  INSERT INTO my_kb (
      SELECT ROW_NUMBER() OVER (ORDER BY order_id) AS id, *
      FROM sample_data.orders
  );
  ```

  * Default ID Column: <br />
    If the `id_column` parameter is not defined, its default values are build from the hash of the content columns and follow the format: `<first 16 char of md5 hash of row content>`.
</Note>

### Example

Here is a sample knowledge base that will be used for examples in the following content.

```sql theme={null}
CREATE KNOWLEDGE_BASE my_kb
USING
    embedding_model = {
        "provider": "openai",
        "model_name" : "text-embedding-3-large",
        "api_key": "sk-abc123"
    },
    reranking_model = {
        "provider": "openai",
        "model_name": "gpt-4o",
        "api_key": "sk-abc123"
    },
    metadata_columns = ['product'],
    content_columns = ['notes'],
    id_column = 'order_id';
```

## `DESCRIBE KNOWLEDGE_BASE` Syntax

Users can get details about the knowledge base using the `DESCRIBE KNOWLEDGE_BASE` command.

```sql theme={null}
DESCRIBE KNOWLEDGE_BASE my_kb;
```

Here is the sample output:

```sql theme={null}
+---------+---------+--------+----------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+--------------------+----------------+-------+----------+
| NAME    | PROJECT | MODEL  | STORAGE                                | PARAMS                                                                                                                                                                                                                                       | INSERT_STARTED_AT | INSERT_FINISHED_AT | PROCESSED_ROWS | ERROR | QUERY_ID |
+---------+---------+--------+----------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+--------------------+----------------+-------+----------+
| my_kb   | mindsdb | [NULL] | my_kb_chromadb.default_collection      | {"embedding_model": {"provider": "openai", "model_name": "text-embedding-ada-002", "api_key": "sk-xxx"}, "reranking_model": {"provider": "openai", "model_name": "gpt-4o", "api_key": "sk-xxx"}, "default_vector_storage": "my_kb_chromadb"} | [NULL]            | [NULL]             | [NULL]         | [NULL]| [NULL]   |
+---------+---------+--------+----------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+--------------------+----------------+-------+----------+
```

## `DROP KNOWLEDGE_BASE` Syntax

Here is the syntax for deleting a knowledge base:

```sql theme={null}
DROP KNOWLEDGE_BASE my_kb;
```

Upon execution, it removes the knowledge base with its content.

Upon execution, it identifies matching records based on the user-defined condition and removes all associated data (metadata, content, chunks, embeddings) for matching records from the KB's storage.
