In this section, we present how to use a web crawler within MindsDB. A web crawler is an automated script designed to systematically browse and index content on the internet. Within MindsDB, you can utilize a web crawler to efficiently collect data from various websites.Documentation Index
Fetch the complete documentation index at: https://docs.mindsdb.com/llms.txt
Use this file to discover all available pages before exploring further.
Prerequisites
Before proceeding, ensure the following prerequisites are met:- Install MindsDB locally via Docker or Docker Desktop.
- To use Web Crawler with MindsDB, install the required dependencies following this instruction.
Connection
This handler does not require any connection parameters. Here is how to initialize a web crawler:Usage
Parameters
Crawl Depth
Thecrawl_depth parameter defines how deep the crawler should navigate through linked pages:
crawl_depth = 0: Crawls only the specified page.crawl_depth = 1: Crawls the specified page and all linked pages on it.- Higher values continue the pattern.
Page Limits
There are multiple ways to limit the number of pages returned:- The
LIMITclause defines the maximum number of pages returned globally. - The
per_url_limitparameter limits the number of pages returned for each specific URL, if more than one URL is provided.
Crawling a Single URL
The following example retrieves data from a single webpage:LIMIT:
Crawling Multiple URLs
To crawl multiple URLs at once:Crawling with Depth
To crawl all pages linked within a website:x is the number of linked webpages.
For multiple URLs with crawl depth:
x and y are the number of linked pages from each URL.
Get PDF Content
MindsDB accepts file uploads ofcsv, xlsx, xls, sheet, json, and parquet. However, you can also configure the web crawler to fetch data from PDF files accessible via URLs.
Configuring Web Handler for Specific Domains
The Web Handler can be configured to interact only with specific domains by using theweb_crawling_allowed_sites setting in the config.json file.
This feature allows you to restrict the handler to crawl and process content only from the domains you specify, enhancing security and control over web interactions.
To configure this, simply list the allowed domains under the web_crawling_allowed_sites key in config.json. For example: