from raghilda.crawl import CrawlScope, DirectoryCrawler
crawler = DirectoryCrawler(cache_dir=True, max_workers=4)
scope = CrawlScope(
roots=["docs"],
depth=3,
include_patterns=[r".*\.(md|qmd|ipynb|pdf)$"],
)
for document in crawler.markdown_documents(scope):
print(document.origin)crawl.DirectoryCrawler
Crawl local files and optionally cache converted markdown.
Usage
crawl.DirectoryCrawler()Use a DirectoryCrawler for local Markdown, notebooks, PDFs, text files, and other formats supported by read_as_markdown(). Directory traversal always reads the current filesystem state. The cache stores converted Markdown per file origin and is reused only when the current file hash and modification time still match the cached metadata. When the cache directory lives inside a crawled root, the crawler skips its own cache files.
Parameters
cache_dir: bool | str | Path | None = None-
Where to cache converted Markdown.
None(default) disables caching.Trueuses.raghilda/cache/directoryunder the current working directory. A string orPathuses that location. max_workers: int = 1- Number of worker threads used to convert files concurrently in markdown_documents(). Must be at least 1. Default is 1.
Examples
Methods
| Name | Description |
|---|---|
| fetch_raw() | Read one local file origin and return its source metadata. |
| markdown_documents() | Discover and convert all local files matched by a scope. |
| origins() | Discover local file origins matched by a scope. |
fetch_raw()
Read one local file origin and return its source metadata.
Usage
fetch_raw(origin, *, cache_force_refresh=False)The returned FetchedSource points at the file on disk and records its size, modification time, content hash, and detected type label. When caching is enabled and the file is unchanged since the last conversion, the cached Markdown path is attached so that conversion can be skipped.
Parameters
origin: str-
A
file://URI (or local path) identifying an existing file. cache_force_refresh: bool = False-
When
True, ignore any cached Markdown for this file so it will be reconverted.
Returns
FetchedSource- The source description for the file at origin.
markdown_documents()
Discover and convert all local files matched by a scope.
Usage
markdown_documents(
scope, *, convert=None, progress=True, cache_force_refresh=False
)Combines origins() and fetch_markdown(), converting files concurrently using up to max_workers threads while preserving traversal order. When caching is enabled, converted Markdown is reused for files whose hash and modification time are unchanged.
Parameters
scope: CrawlScope-
The CrawlScope describing what to crawl.
convert: Callable[[FetchedSource], MarkdownDocument] | None = None-
Optional callable that turns a FetchedSource into a MarkdownDocument. When omitted, the default conversion is used.
progress: bool = True-
Unused; accepted for interface compatibility.
cache_force_refresh: bool = False-
When
True, reconvert files even when cached Markdown is present.
Returns
Iterator[MarkdownDocument]- A lazy iterator of converted documents, in traversal order.
origins()
Discover local file origins matched by a scope.
Usage
origins(scope, *, progress=True, cache_force_refresh=False)Walks each root in the scope, descending up to scope.depth directory levels, and yields a file:// URI for every file that passes the scope’s pattern and type filters. Symlinked directories and the crawler’s own cache directory are skipped. Directory traversal always reflects the current filesystem state, so progress and cache_force_refresh have no effect here.
Parameters
scope: CrawlScope-
The CrawlScope describing what to crawl. roots may be directories or individual files.
progress: bool = True-
Unused; accepted for interface compatibility.
cache_force_refresh: bool = False- Unused; accepted for interface compatibility.
Returns
Iterator[str]-
A lazy iterator of unique
file://origins, in sorted traversal order.