from datetime import timedelta
from raghilda.crawl import CrawlScope, WebCrawler
crawler = WebCrawler(cache_dir=True, cache_stale_after=timedelta(days=1))
scope = CrawlScope(
roots=["https://quarto.org/docs/guide/"],
depth=2,
include_patterns=[r"^https://quarto\.org/docs/guide/"],
include_types=["html"],
)
for document in crawler.markdown_documents(scope):
print(document.origin)crawl.WebCrawler
Crawl a website by fetching pages directly over HTTP.
Usage
crawl.WebCrawler()A WebCrawler starts from one or more root URLs, fetches each page with requests, follows discovered links up to scope.depth, and yields matching pages as MarkdownDocument objects. Link following is constrained by the scope’s patterns, types, and the include_external_links / include_subdomains flags.
Fetched response bodies are cached on disk. When cache_stale_after is set, fresh cached responses are reused and stale ones are revalidated with ETag / Last-Modified headers when the server provides them. Pass cache_force_refresh=True to any method to bypass the cache for a run.
Parameters
session: requests.Session | None = None-
A
requests.Sessionto use for requests. When omitted, a new session is created. A caller-supplied session also scopes the cache so entries are not shared across sessions. cache_dir: bool | str | Path | None = None-
Where to cache fetched bodies.
None(default) uses a temporary directory.Trueuses.raghilda/cache/webunder the current working directory. A string orPathuses that location. cache_stale_after: timedelta | None = None-
How long a cached body stays fresh before it must be revalidated. When
None(default), cached bodies are always considered fresh. max_workers: int = 1- Number of worker threads used to fetch pages concurrently. Must be at least 1. Default is 1.
Examples
Methods
| Name | Description |
|---|---|
| fetch_raw() | Fetch one URL over HTTP and return its source metadata. |
| origins() | Discover web origins reachable from the scope’s roots. |
fetch_raw()
Fetch one URL over HTTP and return its source metadata.
Usage
fetch_raw(origin, *, cache_force_refresh=False)A fresh cached body is returned without a network request. A stale cached body is revalidated with If-None-Match / If-Modified-Since headers; on a 304 Not Modified the cached body is reused. Otherwise the body is downloaded, cached, and its content type and type label are recorded on the returned source.
Parameters
origin: str-
The URL to fetch. It is canonicalized before use and must be an
httporhttpsURL. cache_force_refresh: bool = False-
When
True, ignore any cached body and re-fetch from the server.
Returns
FetchedSource- The fetched source, with its cached body path and metadata.
origins()
Discover web origins reachable from the scope’s roots.
Usage
origins(scope, *, progress=True, cache_force_refresh=False)Performs a breadth-first crawl: each root is fetched, its links are extracted and canonicalized, and the frontier expands one level per scope.depth until the depth or scope.limit is reached. Origins outside the root scope are dropped unless include_external_links or include_subdomains allow them. Only origins passing the scope’s pattern and type filters are yielded.
Parameters
scope: CrawlScope-
The CrawlScope describing what to crawl. roots must be
httporhttpsURLs. progress: bool = True-
Unused; accepted for interface compatibility.
cache_force_refresh: bool = False-
When
True, re-fetch pages instead of using cached bodies while discovering links.
Returns
Iterator[str]- A lazy iterator of unique canonical URLs, in crawl order.