MindsDB provides the TO_MARKDOWN() function that lets users extract the content of their documents in markdown by simply specifying the document path or URL. This function is especially useful for passing the extracted content of documents through LLMs or for storing them in a Knowledge Base.

Configuration

The TO_MARKDOWN() function supports different file formats and methods of passing documents into it, as well as an LLM required for processing documents.

Supported File Formats

The TO_MARKDOWN() function supports PDF, XML, and Nessus file formats. The documents can be provided from URLs, file storage, or Amazon S3 storage.

Supported LLMs

The TO_MARKDOWN() function requires an LLM to process the document content into the Markdown format. The supported LLM providers include:
  • OpenAI
  • Azure OpenAI
  • Google
The model you select must support multi-modal inputs, that is, both images and text. For example, OpenAI’s gpt-4o is a supported multi-modal model.
User can provide an LLM using one of the below methods:
  1. Set the default model in the Settings of MindsDB Editor.
  2. Set the default model in the MindsDB configuration file.
  3. Use environment variables defined below to set an LLM specifically for the TO_MARKDOWN() function. The TO_MARKDOWN_FUNCTION_PROVIDER environment variable defines the selected provider, which is one of openai, azure_openai, or google.

Usage

You can use the TO_MARKDOWN() function to extract the content of your documents in markdown format. The arguments for this function are:
  • file_path_or_url: The path or URL of the document you want to extract content from.
The content of each PDF page is intelligently extracted by first assessing how visually complex the page is. Based on this assessment, the system decides whether traditional text parsing is sufficient or if the page should be processed using an LLM.

Usage with Knowledge Bases

You can also use the TO_MARKDOWN() function to extract content from documents and store it in a Knowledge Base. This is particularly useful for creating a Knowledge Base from a collection of documents.
INSERT INTO my_kb (
  SELECT
    HASH('https://www.princexml.com/howcome/2016/samples/invoice/index.pdf') as id,
    TO_MARKDOWN('https://www.princexml.com/howcome/2016/samples/invoice/index.pdf') as content
)