crawl.BaseCrawler
Abstract base class for crawlers.
Usage
crawl.BaseCrawler()A crawler discovers source documents for a CrawlScope and converts them into MarkdownDocument objects ready for chunking and ingestion. All crawlers expose the same four public methods, so a scope and the surrounding workflow can be reused across backends.
Subclasses provide a concrete discovery and fetching strategy:
- DirectoryCrawler: walk local files and directories.
- WebCrawler: fetch pages directly over HTTP and follow links.
- CloudflareCrawler: delegate discovery and rendering to Cloudflare’s Browser Rendering API.
Attributes
max_workers: int- Number of worker threads used to fetch and convert sources concurrently in markdown_documents().
Methods
| Name | Description |
|---|---|
| fetch_markdown() | Fetch one origin and convert it to a Markdown document. |
| fetch_raw() | Fetch the raw body and metadata for one origin. |
| markdown_documents() | Discover and convert all sources matched by a scope. |
| origins() | Discover the origins matched by a scope. |
fetch_markdown()
Fetch one origin and convert it to a Markdown document.
Usage
fetch_markdown(origin, *, convert=None, cache_force_refresh=False)Parameters
origin: str-
The origin to fetch and convert.
convert: Callable[[FetchedSource], MarkdownDocument] | None = None-
Optional callable that turns a FetchedSource into a MarkdownDocument. When omitted, the crawler’s default conversion is used. Use this to apply custom cleanup; keep chunking in
store.ingest(prepare=...)rather than in the converter. cache_force_refresh: bool = False-
When
True, bypass any cached body and re-fetch from the source.
Returns
MarkdownDocument- The converted document for origin.
fetch_raw()
Fetch the raw body and metadata for one origin.
Usage
fetch_raw(origin, *, cache_force_refresh=False)Parameters
origin: str-
The origin to fetch, as produced by origins().
cache_force_refresh: bool = False-
When
True, bypass any cached body and re-fetch from the source.
Returns
FetchedSource- The fetched source, with its body path and metadata. The raw body is not yet converted to Markdown.
markdown_documents()
Discover and convert all sources matched by a scope.
Usage
markdown_documents(
scope, *, convert=None, progress=True, cache_force_refresh=False
)This is the primary entry point for crawling. It combines origins() and fetch_markdown(), fetching and converting sources concurrently using up to max_workers threads while preserving discovery order. The result is intended to be passed directly to store.ingest().
Parameters
scope: CrawlScope-
The CrawlScope describing what to crawl.
convert: Callable[[FetchedSource], MarkdownDocument] | None = None-
Optional callable that turns a FetchedSource into a MarkdownDocument. When omitted, the crawler’s default conversion is used.
progress: bool = True-
Whether to display crawl progress, when the backend supports it.
cache_force_refresh: bool = False-
When
True, bypass cached discovery and bodies and re-crawl.
Returns
Iterator[MarkdownDocument]- A lazy iterator of converted documents, in discovery order.
origins()
Discover the origins matched by a scope.
Usage
origins(scope, *, progress=True, cache_force_refresh=False)Parameters
scope: CrawlScope-
The CrawlScope describing what to crawl.
progress: bool = True-
Whether to display crawl progress, when the backend supports it.
cache_force_refresh: bool = False-
When
True, bypass any cached discovery results and re-crawl.
Returns
Iterator[str]- A lazy iterator of source origins, in discovery order. Each origin is unique within a single call.