crawl.DirectoryCrawler

Crawl local files and optionally cache converted markdown.

Usage

crawl.DirectoryCrawler(
    *,
    cache_dir=None,
    max_workers=1,
)

Use a DirectoryCrawler for local Markdown, notebooks, PDFs, text files, and other formats supported by read_as_markdown(). Directory traversal always reads the current filesystem state. The cache stores converted Markdown per file origin and is reused only when the current file hash and modification time still match the cached metadata. When the cache directory lives inside a crawled root, the crawler skips its own cache files.

Parameters

cache_dir: bool | str | Path | None = None: Where to cache converted Markdown. None (default) disables caching. True uses .raghilda/cache/directory under the current working directory. A string or Path uses that location.
max_workers: int = 1: Number of worker threads used to convert files concurrently in markdown_documents(). Must be at least 1. Default is 1.

Examples

from raghilda.crawl import CrawlScope, DirectoryCrawler

crawler = DirectoryCrawler(cache_dir=True, max_workers=4)
scope = CrawlScope(
    roots=["docs"],
    depth=3,
    include_patterns=[r".*\.(md|qmd|ipynb|pdf)$"],
)

for document in crawler.markdown_documents(scope):
    print(document.origin)

Methods

Name	Description
fetch_raw()	Read one local file origin and return its source metadata.
markdown_documents()	Discover and convert all local files matched by a scope.
origins()	Discover local file origins matched by a scope.

fetch_raw()

Read one local file origin and return its source metadata.

Usage

Source

fetch_raw(origin, *, cache_force_refresh=False)

The returned FetchedSource points at the file on disk and records its size, modification time, content hash, and detected type label. When caching is enabled and the file is unchanged since the last conversion, the cached Markdown path is attached so that conversion can be skipped.

Parameters

origin: str: A file:// URI (or local path) identifying an existing file.
cache_force_refresh: bool = False: When True, ignore any cached Markdown for this file so it will be reconverted.

Returns

FetchedSource: The source description for the file at origin.

markdown_documents()

Discover and convert all local files matched by a scope.

Usage

Source

markdown_documents(
    scope, *, convert=None, progress=True, cache_force_refresh=False
)

Combines origins() and fetch_markdown(), converting files concurrently using up to max_workers threads while preserving traversal order. When caching is enabled, converted Markdown is reused for files whose hash and modification time are unchanged.

Parameters

scope: CrawlScope: The CrawlScope describing what to crawl.
convert: Callable[[FetchedSource], MarkdownDocument] | None = None: Optional callable that turns a FetchedSource into a MarkdownDocument. When omitted, the default conversion is used.
progress: bool = True: Unused; accepted for interface compatibility.
cache_force_refresh: bool = False: When True, reconvert files even when cached Markdown is present.

Returns

Iterator[MarkdownDocument]: A lazy iterator of converted documents, in traversal order.

origins()

Discover local file origins matched by a scope.

Usage

Source

origins(scope, *, progress=True, cache_force_refresh=False)

Walks each root in the scope, descending up to scope.depth directory levels, and yields a file:// URI for every file that passes the scope’s pattern and type filters. Symlinked directories and the crawler’s own cache directory are skipped. Directory traversal always reflects the current filesystem state, so progress and cache_force_refresh have no effect here.

Parameters

scope: CrawlScope: The CrawlScope describing what to crawl. roots may be directories or individual files.
progress: bool = True: Unused; accepted for interface compatibility.
cache_force_refresh: bool = False: Unused; accepted for interface compatibility.

Returns

Iterator[str]: A lazy iterator of unique file:// origins, in sorted traversal order.