In this section, we present how to use a web crawler within MindsDB.

A web crawler is a computer program or automated script that browses the internet and navigates through websites, web pages, and web content to gather data. The primary purpose of a web crawler is to index and catalog information from the web, allowing search engines to provide relevant search results to users.

This web crawler tool can be utilized within MindsDB to fetch data for training AI models and chatbots.

Connection

This handler does not require any connection parameters.

Here is how to initialize a web crawler:

CREATE DATABASE my_web 
WITH ENGINE = 'web';

If you installed MindsDB locally via pip, you need to install all handler dependencies manually. To do so, go to the handler’s folder (mindsdb/integrations/handlers/web_handler) and run this command: pip install -r requirements.txt.

Usage

Get Websites Content

Here is how to get the content of docs.mindsdb.com:

SELECT * 
FROM my_web.crawler 
WHERE url = 'docs.mindsdb.com' 
LIMIT 1;

You can also get the content of subwebsites. Here is how to fetch the content from 10 subwebsites:

SELECT * 
FROM my_web.crawler 
WHERE url = 'docs.mindsdb.com' 
LIMIT 10;

Another option is to get the content from multiple webistes.

SELECT * 
FROM my_web.crawler 
WHERE url IN ('docs.mindsdb.com', 'docs.python.org') 
LIMIT 1;

Get PDF Content

MindsDB accepts file uploads of csv, xlsx, xls, sheet, json, and parquet. However, you can utilize the web crawler to fetch data from pdf files.

SELECT * 
FROM my_web.crawler 
WHERE url = '<link-to-pdf-file>' 
LIMIT 1;

For example, you can provide a link to a pdf file stores in Amazon S3.