crawl.BaseCrawler

Abstract base class for crawlers.

Usage

Source

crawl.BaseCrawler()

A crawler discovers source documents for a CrawlScope and converts them into MarkdownDocument objects ready for chunking and ingestion. All crawlers expose the same four public methods, so a scope and the surrounding workflow can be reused across backends.

Subclasses provide a concrete discovery and fetching strategy:

Attributes

max_workers: int
Number of worker threads used to fetch and convert sources concurrently in markdown_documents().

Methods

Name Description
fetch_markdown() Fetch one origin and convert it to a Markdown document.
fetch_raw() Fetch the raw body and metadata for one origin.
markdown_documents() Discover and convert all sources matched by a scope.
origins() Discover the origins matched by a scope.

fetch_markdown()

Fetch one origin and convert it to a Markdown document.

Usage

Source

fetch_markdown(origin, *, convert=None, cache_force_refresh=False)
Parameters
origin: str

The origin to fetch and convert.

convert: Callable[[FetchedSource], MarkdownDocument] | None = None

Optional callable that turns a FetchedSource into a MarkdownDocument. When omitted, the crawler’s default conversion is used. Use this to apply custom cleanup; keep chunking in store.ingest(prepare=...) rather than in the converter.

cache_force_refresh: bool = False
When True, bypass any cached body and re-fetch from the source.
Returns
MarkdownDocument
The converted document for origin.

fetch_raw()

Fetch the raw body and metadata for one origin.

Usage

Source

fetch_raw(origin, *, cache_force_refresh=False)
Parameters
origin: str

The origin to fetch, as produced by origins().

cache_force_refresh: bool = False
When True, bypass any cached body and re-fetch from the source.
Returns
FetchedSource
The fetched source, with its body path and metadata. The raw body is not yet converted to Markdown.

markdown_documents()

Discover and convert all sources matched by a scope.

Usage

Source

markdown_documents(
    scope, *, convert=None, progress=True, cache_force_refresh=False
)

This is the primary entry point for crawling. It combines origins() and fetch_markdown(), fetching and converting sources concurrently using up to max_workers threads while preserving discovery order. The result is intended to be passed directly to store.ingest().

Parameters
scope: CrawlScope

The CrawlScope describing what to crawl.

convert: Callable[[FetchedSource], MarkdownDocument] | None = None

Optional callable that turns a FetchedSource into a MarkdownDocument. When omitted, the crawler’s default conversion is used.

progress: bool = True

Whether to display crawl progress, when the backend supports it.

cache_force_refresh: bool = False
When True, bypass cached discovery and bodies and re-crawl.
Returns
Iterator[MarkdownDocument]
A lazy iterator of converted documents, in discovery order.

origins()

Discover the origins matched by a scope.

Usage

Source

origins(scope, *, progress=True, cache_force_refresh=False)
Parameters
scope: CrawlScope

The CrawlScope describing what to crawl.

progress: bool = True

Whether to display crawl progress, when the backend supports it.

cache_force_refresh: bool = False
When True, bypass any cached discovery results and re-crawl.
Returns
Iterator[str]
A lazy iterator of source origins, in discovery order. Each origin is unique within a single call.