MindsDB provides the TO_MARKDOWN() function that lets users extract the content of their documents in markdown by simply specifying the document path or URL. This function is especially useful for passing the extracted content of documents through LLMs or for storing them in a Knowledge Base.

Configuration

The TO_MARKDOWN() function supports different file formats and methods of passing documents into it, as well as an LLM required for processing documents.

Supported File Formats

The TO_MARKDOWN() function supports PDF, XML, and Nessus file formats. The documents can be provided from URLs, file storage, or Amazon S3 storage.

Supported LLMs

The TO_MARKDOWN() function requires an LLM to process the document content into the Markdown format. The supported LLM providers include:
  • OpenAI
  • Azure OpenAI
  • Google
The model you select must support multi-modal inputs, that is, both images and text. For example, OpenAI’s gpt-4o is a supported multi-modal model.
User can provide an LLM using one of the below methods:
  1. Set the default model in the Settings of MindsDB Editor.
  2. Set the default model in the MindsDB configuration file.
  3. Use environment variables defined below to set an LLM specifically for the TO_MARKDOWN() function. The TO_MARKDOWN_FUNCTION_PROVIDER environment variable defines the selected provider, which is one of openai, azure_openai, or google.
    Here are the environment variables for the OpenAI provider:
    TO_MARKDOWN_FUNCTION_API_KEY (required)
    TO_MARKDOWN_FUNCTION_MODEL_NAME
    TO_MARKDOWN_FUNCTION_TEMPERATURE
    TO_MARKDOWN_FUNCTION_MAX_RETRIES
    TO_MARKDOWN_FUNCTION_MAX_TOKENS
    TO_MARKDOWN_FUNCTION_BASE_URL
    TO_MARKDOWN_FUNCTION_API_ORGANIZATION
    TO_MARKDOWN_FUNCTION_REQUEST_TIMEOUT
    
    Here are the environment variables for the Azure OpenAI provider:
    TO_MARKDOWN_FUNCTION_API_KEY (required)
    TO_MARKDOWN_FUNCTION_BASE_URL (required)
    TO_MARKDOWN_FUNCTION_API_VERSION (required)
    TO_MARKDOWN_FUNCTION_MODEL_NAME
    TO_MARKDOWN_FUNCTION_TEMPERATURE
    TO_MARKDOWN_FUNCTION_MAX_RETRIES
    TO_MARKDOWN_FUNCTION_MAX_TOKENS
    TO_MARKDOWN_FUNCTION_API_ORGANIZATION
    TO_MARKDOWN_FUNCTION_REQUEST_TIMEOUT
    
    Here are the environment variables for the Google provider:
    TO_MARKDOWN_FUNCTION_API_KEY
    TO_MARKDOWN_FUNCTION_MODEL_NAME
    TO_MARKDOWN_FUNCTION_TEMPERATURE
    TO_MARKDOWN_FUNCTION_MAX_TOKENS
    TO_MARKDOWN_FUNCTION_REQUEST_TIMEOUT
    

Usage

You can use the TO_MARKDOWN() function to extract the content of your documents in markdown format. The arguments for this function are:
  • file_path_or_url: The path or URL of the document you want to extract content from.
The following example shows how to use the TO_MARKDOWN() function with a PDF document from Amazon S3 storage connected to MindsDB.
SELECT TO_MARKDOWN(public_url) FROM s3_datasource.files;
Here are the steps for passing files from Amazon S3 into TO_MARKDOWN().
  1. Connect Amazon S3 to MindsDB following this instruction.
  2. The public_url of the file is generated in the s3_datasource.files table upon connecting the Amazon S3 data source to MindsDB.
  3. Upon running the above query, the public_url of the file is selected from the s3_datasource.files table.
The following example shows how to use the TO_MARKDOWN() function with a PDF document from URL.
SELECT TO_MARKDOWN('https://www.princexml.com/howcome/2016/samples/invoice/index.pdf');
Here is the output:
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| to_markdown                                                                                                                                                                                                                                  |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ```markdown                                                                                                                                                                                                                                  |
| # Invoice                                                                                                                                                                                                                                    |
|                                                                                                                                                                                                                                              |
| YesLogic Pty. Ltd.                                                                                                                                                                                                                           |
| 7 / 39 Bouverie St                                                                                                                                                                                                                           |
| Carlton VIC 3053                                                                                                                                                                                                                             |
| Australia                                                                                                                                                                                                                                    |
|                                                                                                                                                                                                                                              |
| www.yeslogic.com                                                                                                                                                                                                                             |
| ABN 32 101 193 560                                                                                                                                                                                                                           |
|                                                                                                                                                                                                                                              |
| Customer Name                                                                                                                                                                                                                                |
| Street                                                                                                                                                                                                                                       |
| Postcode City                                                                                                                                                                                                                                |
| Country                                                                                                                                                                                                                                      |
|                                                                                                                                                                                                                                              |
| Invoice date: | Nov 26, 2016                                                                                                                                                                                                                 |
| --- | ---                                                                                                                                                                                                                                    |
| Invoice number: | 161126                                                                                                                                                                                                                     |
| Payment due: | 30 days after invoice date                                                                                                                                                                                                    |
|                                                                                                                                                                                                                                              |
| | Description               | From        | Until       | Amount     |                                                                                                                                                                       |
| |---------------------------|-------------|-------------|------------|                                                                                                                                                                       |
| | Prince Upgrades & Support | Nov 26, 2016 | Nov 26, 2017 | USD $950.00 |                                                                                                                                                                    |
| | Total                     |             |             | USD $950.00 |                                                                                                                                                                      |
|                                                                                                                                                                                                                                              |
| Please transfer amount to:                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                              |
| Bank account name: | Yes Logic Pty Ltd                                                                                                                                                                                                       |
| --- | ---                                                                                                                                                                                                                                    |
| Name of Bank: | Commonwealth Bank of Australia (CBA)                                                                                                                                                                                         |
| Bank State Branch (BSB): | 063010                                                                                                                                                                                                            |
| Bank State Branch (BSB): | 063010                                                                                                                                                                                                            |
| Bank State Branch (BSB): | 063019                                                                                                                                                                                                            |
| Bank account number: | 13201652                                                                                                                                                                                                              |
| Bank SWIFT code: | CTBAAU2S                                                                                                                                                                                                                  |
| Bank address: | 231 Swanston St, Melbourne, VIC 3000, Australia                                                                                                                                                                              |
|                                                                                                                                                                                                                                              |
| The BSB number identifies a branch of a financial institution in Australia. When transferring money to Australia, the BSB number is used together with the bank account number and the SWIFT code. Australian banks do not use IBAN numbers. |
|                                                                                                                                                                                                                                              |
| www.yeslogic.com                                                                                                                                                                                                                             |
| ```                                                                                                                                                                                                                                          |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
The content of each PDF page is intelligently extracted by first assessing how visually complex the page is. Based on this assessment, the system decides whether traditional text parsing is sufficient or if the page should be processed using an LLM.

Usage with Knowledge Bases

You can also use the TO_MARKDOWN() function to extract content from documents and store it in a Knowledge Base. This is particularly useful for creating a Knowledge Base from a collection of documents.
INSERT INTO my_kb (
  SELECT
    HASH('https://www.princexml.com/howcome/2016/samples/invoice/index.pdf') as id,
    TO_MARKDOWN('https://www.princexml.com/howcome/2016/samples/invoice/index.pdf') as content
)