> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mindsdb.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Web Crawler

In this section, we present how to use a web crawler within MindsDB.

A web crawler is an automated script designed to systematically browse and index content on the internet. Within MindsDB, you can utilize a web crawler to efficiently collect data from various websites.

## Prerequisites

Before proceeding, ensure the following prerequisites are met:

1. Install MindsDB locally via [Docker](/setup/self-hosted/docker) or [Docker Desktop](/setup/self-hosted/docker-desktop).
2. To use Web Crawler with MindsDB, install the required dependencies following [this instruction](/setup/self-hosted/docker#install-dependencies).

## Connection

This handler does not require any connection parameters.

Here is how to initialize a web crawler:

```sql theme={null}
CREATE DATABASE my_web 
WITH ENGINE = 'web';
```

<Tip>
  The above query creates a database called `my_web`. This database by default has a table called `crawler` that stores data from a given URL or multiple URLs.
</Tip>

## Usage

### Parameters

#### Crawl Depth

The `crawl_depth` parameter defines how deep the crawler should navigate through linked pages:

* `crawl_depth = 0`: Crawls only the specified page.
* `crawl_depth = 1`: Crawls the specified page and all linked pages on it.
* Higher values continue the pattern.

#### Page Limits

There are multiple ways to limit the number of pages returned:

* The `LIMIT` clause defines the maximum number of pages returned globally.
* The `per_url_limit` parameter limits the number of pages returned for each specific URL, if more than one URL is provided.

### Crawling a Single URL

The following example retrieves data from a single webpage:

```sql theme={null}
SELECT * 
FROM my_web.crawler 
WHERE url = 'https://docs.mindsdb.com/';
```

Returns **1 row** by default.

To retrieve more pages from the same URL, specify the `LIMIT`:

```sql theme={null}
SELECT * 
FROM my_web.crawler 
WHERE url = 'https://docs.mindsdb.com/'
LIMIT 30;
```

Returns up to **30 rows**.

### Crawling Multiple URLs

To crawl multiple URLs at once:

```sql theme={null}
SELECT * 
FROM my_web.crawler 
WHERE url IN ('https://docs.mindsdb.com/', 'https://dev.mysql.com/doc/', 'https://mindsdb.com/');
```

Returns **3 rows** by default (1 row per URL).

To apply a per-URL limit:

```sql theme={null}
SELECT * 
FROM my_web.crawler 
WHERE url IN ('https://docs.mindsdb.com/', 'https://dev.mysql.com/doc/')
AND per_url_limit = 2;
```

Returns **4 rows** (2 rows per URL).

### Crawling with Depth

To crawl all pages linked within a website:

```sql theme={null}
SELECT * 
FROM my_web.crawler 
WHERE url = 'https://docs.mindsdb.com/'
AND crawl_depth = 1;
```

Returns **1 + x rows**, where `x` is the number of linked webpages.

For multiple URLs with crawl depth:

```sql theme={null}
SELECT * 
FROM my_web.crawler 
WHERE url IN ('https://docs.mindsdb.com/', 'https://dev.mysql.com/doc/')
AND crawl_depth = 1;
```

Returns **2 + x + y rows**, where `x` and `y` are the number of linked pages from each URL.

### Get PDF Content

MindsDB accepts [file uploads](/sql/create/file) of `csv`, `xlsx`, `xls`, `sheet`, `json`, and `parquet`. However, you can also configure the web crawler to fetch data from PDF files accessible via URLs.

```sql theme={null}
SELECT * 
FROM my_web.crawler 
WHERE url = '<link-to-pdf-file>' 
LIMIT 1;
```

### Configuring Web Handler for Specific Domains

The Web Handler can be configured to interact only with specific domains by using the `web_crawling_allowed_sites` setting in the `config.json` file.
This feature allows you to restrict the handler to crawl and process content only from the domains you specify, enhancing security and control over web interactions.

To configure this, simply list the allowed domains under the `web_crawling_allowed_sites` key in `config.json`. For example:

```json theme={null}
"web_crawling_allowed_sites": [
    "https://docs.mindsdb.com",
    "https://another-allowed-site.com"
]
```

## Troubleshooting

<Warning>
  `Web crawler encounters character encoding issues`

  * **Symptoms**: Extracted text appears garbled or contains strange characters instead of the expected text.
  * **Checklist**:
    1. Open a GitHub Issue: If you encounter a bug or a repeatable error with encoding,
       report it on the [MindsDB GitHub](https://github.com/mindsdb/mindsdb/issues) repository by opening an issue.
</Warning>

<Warning>
  `Web crawler times out while trying to fetch content`

  * **Symptoms**: The crawler fails to retrieve data from a website, resulting in timeout errors.
  * **Checklist**:
    1. Check the network connection to ensure the target site is reachable.
</Warning>
