crawl.BaseCrawler

Abstract base class for crawlers.

Usage

crawl.BaseCrawler()

A crawler discovers source documents for a CrawlScope and converts them into MarkdownDocument objects ready for chunking and ingestion. All crawlers expose the same four public methods, so a scope and the surrounding workflow can be reused across backends.

Subclasses provide a concrete discovery and fetching strategy:

DirectoryCrawler: walk local files and directories.
WebCrawler: fetch pages directly over HTTP and follow links.
CloudflareCrawler: delegate discovery and rendering to Cloudflare’s Browser Rendering API.

Attributes

max_workers: int: Number of worker threads used to fetch and convert sources concurrently in markdown_documents().

Methods

Name	Description
fetch_markdown()	Fetch one origin and convert it to a Markdown document.
fetch_raw()	Fetch the raw body and metadata for one origin.
markdown_documents()	Discover and convert all sources matched by a scope.
origins()	Discover the origins matched by a scope.

fetch_markdown()

Fetch one origin and convert it to a Markdown document.

Usage

Source

fetch_markdown(origin, *, convert=None, cache_force_refresh=False)

Parameters

origin: str: The origin to fetch and convert.
convert: Callable[[FetchedSource], MarkdownDocument] | None = None: Optional callable that turns a FetchedSource into a MarkdownDocument. When omitted, the crawler’s default conversion is used. Use this to apply custom cleanup; keep chunking in store.ingest(prepare=...) rather than in the converter.
cache_force_refresh: bool = False: When True, bypass any cached body and re-fetch from the source.

Returns

MarkdownDocument: The converted document for origin.

fetch_raw()

Fetch the raw body and metadata for one origin.

Usage

Source

fetch_raw(origin, *, cache_force_refresh=False)

Parameters

origin: str: The origin to fetch, as produced by origins().
cache_force_refresh: bool = False: When True, bypass any cached body and re-fetch from the source.

Returns

FetchedSource: The fetched source, with its body path and metadata. The raw body is not yet converted to Markdown.

markdown_documents()

Discover and convert all sources matched by a scope.

Usage

Source

markdown_documents(
    scope, *, convert=None, progress=True, cache_force_refresh=False
)

This is the primary entry point for crawling. It combines origins() and fetch_markdown(), fetching and converting sources concurrently using up to max_workers threads while preserving discovery order. The result is intended to be passed directly to store.ingest().

Parameters

scope: CrawlScope: The CrawlScope describing what to crawl.
convert: Callable[[FetchedSource], MarkdownDocument] | None = None: Optional callable that turns a FetchedSource into a MarkdownDocument. When omitted, the crawler’s default conversion is used.
progress: bool = True: Whether to display crawl progress, when the backend supports it.
cache_force_refresh: bool = False: When True, bypass cached discovery and bodies and re-crawl.

Returns

Iterator[MarkdownDocument]: A lazy iterator of converted documents, in discovery order.

origins()

Discover the origins matched by a scope.

Usage

Source

origins(scope, *, progress=True, cache_force_refresh=False)

Parameters

scope: CrawlScope: The CrawlScope describing what to crawl.
progress: bool = True: Whether to display crawl progress, when the backend supports it.
cache_force_refresh: bool = False: When True, bypass any cached discovery results and re-crawl.

Returns

Iterator[str]: A lazy iterator of source origins, in discovery order. Each origin is unique within a single call.