crawl.FetchedSource

A fetched source document and its metadata, prior to conversion.

Usage

Source

crawl.FetchedSource(
    origin,
    body_path,
    resolved_origin=None,
    content_type=None,
    status_code=None,
    metadata=None,
    fetched_at=None,
    revalidated_at=None,
    markdown_path=None
)

A FetchedSource is the intermediate result returned by a crawler’s fetch_raw(). It points at the raw body on disk and carries the metadata needed to convert it to a MarkdownDocument. Custom convert callables passed to fetch_markdown() or markdown_documents() receive an instance of this class.

Parameter Attributes

origin: str

The canonical origin the source was requested from (a file:// URI for local files, an http(s) URL for web sources).

body_path: Path

Filesystem path to the raw fetched body. For local files this is the file itself; for web and Cloudflare sources it is a cached copy.

resolved_origin: str | None = None

The final origin after any redirects, when it differs from origin.

content_type: str | None = None

The reported MIME type, such as "text/html", when available.

status_code: int | None = None

The HTTP status code for web sources, when available.

metadata: dict[str, Any] | None = None

Backend-specific metadata, such as the detected type_label, validators (etag, last_modified), and source hashes.

fetched_at: datetime | None = None

When the source body was fetched, when known.

revalidated_at: datetime | None = None

When a cached body was last revalidated against the server, when known.

markdown_path: Path | None = None
Filesystem path to already-converted Markdown, when the backend produced or cached it. None when conversion has not run.