In this section, we present how to use a web crawler within MindsDB.
A web crawler is a computer program or automated script that browses the internet and navigates through websites, web pages, and web content to gather data. The primary purpose of a web crawler is to index and catalog information from the web, allowing search engines to provide relevant search results to users.
This web crawler tool can be utilized within MindsDB to fetch data for training AI models and chatbots.
This handler does not require any connection parameters.
Here is how to initialize a web crawler:
CREATE DATABASE my_web WITH ENGINE = 'web';
If you installed MindsDB locally via pip, you need to install all handler dependencies manually. To do so, go to the handler’s folder (mindsdb/integrations/handlers/web_handler) and run this command:
pip install -r requirements.txt.
Get Websites Content
Here is how to get the content of
SELECT * FROM my_web.crawler WHERE url = 'docs.mindsdb.com' LIMIT 1;
You can also get the content of subwebsites. Here is how to fetch the content from 10 subwebsites:
SELECT * FROM my_web.crawler WHERE url = 'docs.mindsdb.com' LIMIT 10;
Another option is to get the content from multiple webistes.
SELECT * FROM my_web.crawler WHERE url IN ('docs.mindsdb.com', 'docs.python.org') LIMIT 1;
Get PDF Content
MindsDB accepts file uploads of
parquet. However, you can utilize the web crawler to fetch data from
SELECT * FROM my_web.crawler WHERE url = '<link-to-pdf-file>' LIMIT 1;
For example, you can provide a link to a