crawl.CloudflareCrawler

Crawl a website using Cloudflare’s Browser Rendering API.

Usage

crawl.CloudflareCrawler(
    *,
    account_id,
    api_token,
    cache_dir=None,
    session=None,
    source="all",
    render=True,
    cache_stale_after=None,
    modified_since=None,
    poll_interval=5.0,
    max_poll_attempts=60,
    max_workers=1,
    base_url="https://api.cloudflare.com/client/v4"
)

A CloudflareCrawler delegates page discovery, JavaScript rendering, and Markdown extraction to Cloudflare. It submits a crawl job, polls until the job completes, retrieves the rendered Markdown record for each discovered page, and yields them as MarkdownDocument objects. This is useful for single-page applications and other sites whose content only appears after client-side rendering.

For Cloudflare crawls, include_patterns and exclude_patterns accept the same glob strings (such as "https://example.com/docs/**") or compiled re.Pattern objects as the other crawlers. Glob strings are forwarded to Cloudflare’s crawl request; regex patterns are enforced locally on the returned records. include_external_links / include_subdomains are passed through to the crawl request. Cached entries are invalidated automatically when render, source, or modified_since change between runs.

Parameters

account_id: str: The Cloudflare account ID that owns the Browser Rendering subscription.
api_token: str: A Cloudflare API token with Browser Rendering permissions. Sent as a bearer token on each request.
cache_dir: bool | str | Path | None = None: Where to cache crawl results and page records. None (default) uses a temporary directory. True uses .raghilda/cache/cloudflare under the current working directory. A string or Path uses that location.
session: requests.Session | Any | None = None: A requests.Session to use for API calls. When omitted, a new session is created.
source: str = "all": How Cloudflare discovers pages: "all" (default) combines available methods, "sitemap" reads the site’s sitemap, and "crawl" follows links from the rendered DOM.
render: bool = True: Whether Cloudflare executes JavaScript before extracting content. Defaults to True. Set to False for server-rendered sites to reduce crawl time and API usage.
cache_stale_after: timedelta | None = None: How long cached results stay fresh. When stale, the crawler requests updated content with a maxAge hint. When None (default), cached entries never expire.
modified_since: int | None = None: Restrict the crawl to pages modified after this Unix timestamp, for incremental refreshes. When None (default), all pages are eligible.
poll_interval: float = 5.0: Seconds to wait between job-status polls. Default is 5.0.
max_poll_attempts: int = 60: Maximum number of status polls before a TimeoutError is raised. Default is 60 (a five-minute window at the default interval).
max_workers: int = 1: Number of worker threads used to materialize page records concurrently. Must be at least 1. Default is 1.
base_url: str = "https://api.cloudflare.com/client/v4": Base URL of the Cloudflare API. Defaults to the public endpoint and is rarely overridden outside of testing.

Examples

import os

from raghilda.crawl import CloudflareCrawler, CrawlScope

crawler = CloudflareCrawler(
    account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"],
    api_token=os.environ["CLOUDFLARE_API_TOKEN"],
    cache_dir=True,
    render=True,
)
scope = CrawlScope(
    roots=["https://example.com/docs/"],
    depth=2,
    include_patterns=["https://example.com/docs/**"],
)

for document in crawler.markdown_documents(scope):
    print(document.origin)

Methods

Name	Description
fetch_raw()	Fetch the rendered Markdown record for one Cloudflare origin.
origins()	Discover origins by running a Cloudflare crawl for each root.

fetch_raw()

Fetch the rendered Markdown record for one Cloudflare origin.

Usage

Source

fetch_raw(origin, *, cache_force_refresh=False)

Returns a fresh in-memory or on-disk cached record when available. Otherwise it runs a single-URL Cloudflare crawl to produce the record. The returned source’s body path points at the rendered Markdown, which is already suitable for conversion without further processing.

Parameters

origin: str: The URL to fetch. It is canonicalized before use.
cache_force_refresh: bool = False: When True, ignore cached records and request a fresh crawl.

Returns

FetchedSource: The fetched source for origin, with its rendered Markdown body path and metadata.

Raises

ValueError: If Cloudflare does not return a record for origin.

origins()

Discover origins by running a Cloudflare crawl for each root.

Usage

Source

origins(scope, *, progress=True, cache_force_refresh=False)

Submits a crawl job for each root in the scope and yields the canonical URL of every completed page record that passes the scope’s pattern and type filters. Results are cached per root so repeated runs can avoid new Cloudflare API calls while the cache is fresh.

Parameters

scope: CrawlScope: The CrawlScope describing what to crawl. roots must be http or https URLs.
progress: bool = True: Unused; accepted for interface compatibility.
cache_force_refresh: bool = False: When True, submit a fresh crawl job instead of reusing cached results.

Returns

Iterator[str]: A lazy iterator of unique canonical URLs returned by Cloudflare.