Web Crawler
In this section, we present how to use a web crawler within MindsDB.
A web crawler is a computer program or automated script that browses the internet and navigates through websites, web pages, and web content to gather data. The primary purpose of a web crawler is to index and catalog information from the web, allowing search engines to provide relevant search results to users.
This web crawler tool can be utilized within MindsDB to fetch data for training AI models and chatbots.
Connection
This handler does not require any connection parameters.
Here is how to initialize a web crawler:
CREATE DATABASE my_web
WITH ENGINE = 'web';
If you installed MindsDB locally via pip, you need to install all handler dependencies manually. To do so, go to the handler’s folder (mindsdb/integrations/handlers/web_handler) and run this command: pip install -r requirements.txt
.
Usage
Get Websites Content
Here is how to get the content of docs.mindsdb.com
:
SELECT *
FROM my_web.crawler
WHERE url = 'docs.mindsdb.com'
LIMIT 1;
You can also get the content of subwebsites. Here is how to fetch the content from 10 subwebsites:
SELECT *
FROM my_web.crawler
WHERE url = 'docs.mindsdb.com'
LIMIT 10;
Another option is to get the content from multiple webistes.
SELECT *
FROM my_web.crawler
WHERE url IN ('docs.mindsdb.com', 'docs.python.org')
LIMIT 1;
Get PDF Content
MindsDB accepts file uploads of csv
, xlsx
, xls
, sheet
, json
, and parquet
. However, you can utilize the web crawler to fetch data from pdf
files.
SELECT *
FROM my_web.crawler
WHERE url = '<link-to-pdf-file>'
LIMIT 1;
For example, you can provide a link to a pdf
file stores in Amazon S3.