crawl.WebCrawler

Crawl a website by fetching pages directly over HTTP.

Usage

Source

crawl.WebCrawler()

A WebCrawler starts from one or more root URLs, fetches each page with requests, follows discovered links up to scope.depth, and yields matching pages as MarkdownDocument objects. Link following is constrained by the scope’s patterns, types, and the include_external_links / include_subdomains flags.

Fetched response bodies are cached on disk. When cache_stale_after is set, fresh cached responses are reused and stale ones are revalidated with ETag / Last-Modified headers when the server provides them. Pass cache_force_refresh=True to any method to bypass the cache for a run.

Parameters

session: requests.Session | None = None

A requests.Session to use for requests. When omitted, a new session is created. A caller-supplied session also scopes the cache so entries are not shared across sessions.

cache_dir: bool | str | Path | None = None

Where to cache fetched bodies. None (default) uses a temporary directory. True uses .raghilda/cache/web under the current working directory. A string or Path uses that location.

cache_stale_after: timedelta | None = None

How long a cached body stays fresh before it must be revalidated. When None (default), cached bodies are always considered fresh.

max_workers: int = 1
Number of worker threads used to fetch pages concurrently. Must be at least 1. Default is 1.

Examples

from datetime import timedelta

from raghilda.crawl import CrawlScope, WebCrawler

crawler = WebCrawler(cache_dir=True, cache_stale_after=timedelta(days=1))
scope = CrawlScope(
    roots=["https://quarto.org/docs/guide/"],
    depth=2,
    include_patterns=[r"^https://quarto\.org/docs/guide/"],
    include_types=["html"],
)

for document in crawler.markdown_documents(scope):
    print(document.origin)

Methods

Name Description
fetch_raw() Fetch one URL over HTTP and return its source metadata.
origins() Discover web origins reachable from the scope’s roots.

fetch_raw()

Fetch one URL over HTTP and return its source metadata.

Usage

Source

fetch_raw(origin, *, cache_force_refresh=False)

A fresh cached body is returned without a network request. A stale cached body is revalidated with If-None-Match / If-Modified-Since headers; on a 304 Not Modified the cached body is reused. Otherwise the body is downloaded, cached, and its content type and type label are recorded on the returned source.

Parameters
origin: str

The URL to fetch. It is canonicalized before use and must be an http or https URL.

cache_force_refresh: bool = False
When True, ignore any cached body and re-fetch from the server.
Returns
FetchedSource
The fetched source, with its cached body path and metadata.

origins()

Discover web origins reachable from the scope’s roots.

Usage

Source

origins(scope, *, progress=True, cache_force_refresh=False)

Performs a breadth-first crawl: each root is fetched, its links are extracted and canonicalized, and the frontier expands one level per scope.depth until the depth or scope.limit is reached. Origins outside the root scope are dropped unless include_external_links or include_subdomains allow them. Only origins passing the scope’s pattern and type filters are yielded.

Parameters
scope: CrawlScope

The CrawlScope describing what to crawl. roots must be http or https URLs.

progress: bool = True

Unused; accepted for interface compatibility.

cache_force_refresh: bool = False
When True, re-fetch pages instead of using cached bodies while discovering links.
Returns
Iterator[str]
A lazy iterator of unique canonical URLs, in crawl order.