crawl.FetchedSource

A fetched source document and its metadata, prior to conversion.

Usage

crawl.FetchedSource(
    origin,
    body_path,
    resolved_origin=None,
    content_type=None,
    status_code=None,
    metadata=None,
    fetched_at=None,
    revalidated_at=None,
    markdown_path=None
)

A FetchedSource is the intermediate result returned by a crawler’s fetch_raw(). It points at the raw body on disk and carries the metadata needed to convert it to a MarkdownDocument. Custom convert callables passed to fetch_markdown() or markdown_documents() receive an instance of this class.

Parameters

origin: str: The canonical origin the source was requested from (a file:// URI for local files, an http(s) URL for web sources).
body_path: Path: Filesystem path to the raw fetched body. For local files this is the file itself; for web and Cloudflare sources it is a cached copy.
resolved_origin: str | None = None: The final origin after any redirects, when it differs from origin.
content_type: str | None = None: The reported MIME type, such as "text/html", when available.
status_code: int | None = None: The HTTP status code for web sources, when available.
metadata: dict[str, Any] | None = None: Backend-specific metadata, such as the detected type_label, validators (etag, last_modified), and source hashes.
fetched_at: datetime | None = None: When the source body was fetched, when known.
revalidated_at: datetime | None = None: When a cached body was last revalidated against the server, when known.
markdown_path: Path | None = None: Filesystem path to already-converted Markdown, when the backend produced or cached it. None when conversion has not run.