Web Crawler
In this section, we present how to use a web crawler within MindsDB.
A web crawler is an automated script designed to systematically browse and index content on the internet. Within MindsDB, you can utilize a web crawler to efficiently collect data from various websites.
Prerequisites
Before proceeding, ensure the following prerequisites are met:
- Install MindsDB locally via Docker or Docker Desktop.
- To use Web Crawler with MindsDB, install the required dependencies following this instruction.
Connection
This handler does not require any connection parameters.
Here is how to initialize a web crawler:
CREATE DATABASE my_web
WITH ENGINE = 'web';
The above query creates a database called my_web
. This database by default has a table called crawler
that we can use to crawl data from a given url/urls.
Usage
Specifying a LIMIT
clause is required. To crawl all pages on a site, consider setting the limit to a high value, such as 10,000, which exceeds the expected number of pages. Be aware that setting a higher limit may result in longer response times.
Get Websites Content
The following usage examples demonstrate how to retrieve content from docs.mindsdb.com
:
SELECT *
FROM my_web.crawler
WHERE url = 'docs.mindsdb.com'
LIMIT 1;
You can also retrieve content from internal pages. The following query fetches the content from 10 internal pages:
SELECT *
FROM my_web.crawler
WHERE url = 'docs.mindsdb.com'
LIMIT 10;
Another option is to get the content from multiple websites by using the IN ()
operator:
SELECT *
FROM my_web.crawler
WHERE url IN ('docs.mindsdb.com', 'docs.python.org')
LIMIT 1;
Get PDF Content
MindsDB accepts file uploads of csv
, xlsx
, xls
, sheet
, json
, and parquet
. However, you can also configure the web crawler to fetch data from PDF files accessible via URLs.
SELECT *
FROM my_web.crawler
WHERE url = '<link-to-pdf-file>'
LIMIT 1;
Troubleshooting
Web crawler encounters character encoding issues
- Symptoms: Extracted text appears garbled or contains strange characters instead of the expected text.
- Checklist:
- Open a GitHub Issue: If you encounter a bug or a repeatable error with encoding, report it on the MindsDB GitHub repository by opening an issue.
Web crawler times out while trying to fetch content
- Symptoms: The crawler fails to retrieve data from a website, resulting in timeout errors.
- Checklist:
- Check the network connection to ensure the target site is reachable.