my_web
. This database by default has a table called crawler
that stores data from a given URL or multiple URLs.crawl_depth
parameter defines how deep the crawler should navigate through linked pages:
crawl_depth = 0
: Crawls only the specified page.crawl_depth = 1
: Crawls the specified page and all linked pages on it.LIMIT
clause defines the maximum number of pages returned globally.per_url_limit
parameter limits the number of pages returned for each specific URL, if more than one URL is provided.LIMIT
:
x
is the number of linked webpages.
For multiple URLs with crawl depth:
x
and y
are the number of linked pages from each URL.
csv
, xlsx
, xls
, sheet
, json
, and parquet
. However, you can also configure the web crawler to fetch data from PDF files accessible via URLs.
web_crawling_allowed_sites
setting in the config.json
file.
This feature allows you to restrict the handler to crawl and process content only from the domains you specify, enhancing security and control over web interactions.
To configure this, simply list the allowed domains under the web_crawling_allowed_sites
key in config.json
. For example:
Web crawler encounters character encoding issues
Web crawler times out while trying to fetch content