import os
from raghilda.crawl import CloudflareCrawler, CrawlScope
crawler = CloudflareCrawler(
account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"],
api_token=os.environ["CLOUDFLARE_API_TOKEN"],
cache_dir=True,
render=True,
)
scope = CrawlScope(
roots=["https://example.com/docs/"],
depth=2,
include_patterns=["https://example.com/docs/**"],
)
for document in crawler.markdown_documents(scope):
print(document.origin)crawl.CloudflareCrawler
Crawl a website using Cloudflare’s Browser Rendering API.
Usage
crawl.CloudflareCrawler()A CloudflareCrawler delegates page discovery, JavaScript rendering, and Markdown extraction to Cloudflare. It submits a crawl job, polls until the job completes, retrieves the rendered Markdown record for each discovered page, and yields them as MarkdownDocument objects. This is useful for single-page applications and other sites whose content only appears after client-side rendering.
For Cloudflare crawls, include_patterns and exclude_patterns accept the same glob strings (such as "https://example.com/docs/**") or compiled re.Pattern objects as the other crawlers. Glob strings are forwarded to Cloudflare’s crawl request; regex patterns are enforced locally on the returned records. include_external_links / include_subdomains are passed through to the crawl request. Cached entries are invalidated automatically when render, source, or modified_since change between runs.
Parameters
account_id: str-
The Cloudflare account ID that owns the Browser Rendering subscription.
api_token: str-
A Cloudflare API token with Browser Rendering permissions. Sent as a bearer token on each request.
cache_dir: bool | str | Path | None = None-
Where to cache crawl results and page records.
None(default) uses a temporary directory.Trueuses.raghilda/cache/cloudflareunder the current working directory. A string orPathuses that location. session: requests.Session | Any | None = None-
A
requests.Sessionto use for API calls. When omitted, a new session is created. source: str = "all"-
How Cloudflare discovers pages:
"all"(default) combines available methods,"sitemap"reads the site’s sitemap, and"crawl"follows links from the rendered DOM. render: bool = True-
Whether Cloudflare executes JavaScript before extracting content. Defaults to
True. Set toFalsefor server-rendered sites to reduce crawl time and API usage. cache_stale_after: timedelta | None = None-
How long cached results stay fresh. When stale, the crawler requests updated content with a
maxAgehint. WhenNone(default), cached entries never expire. modified_since: int | None = None-
Restrict the crawl to pages modified after this Unix timestamp, for incremental refreshes. When
None(default), all pages are eligible. poll_interval: float = 5.0-
Seconds to wait between job-status polls. Default is
5.0. max_poll_attempts: int = 60-
Maximum number of status polls before a
TimeoutErroris raised. Default is60(a five-minute window at the default interval). max_workers: int = 1-
Number of worker threads used to materialize page records concurrently. Must be at least 1. Default is 1.
base_url: str = "https://api.cloudflare.com/client/v4"- Base URL of the Cloudflare API. Defaults to the public endpoint and is rarely overridden outside of testing.
Examples
Methods
| Name | Description |
|---|---|
| fetch_raw() | Fetch the rendered Markdown record for one Cloudflare origin. |
| origins() | Discover origins by running a Cloudflare crawl for each root. |
fetch_raw()
Fetch the rendered Markdown record for one Cloudflare origin.
Usage
fetch_raw(origin, *, cache_force_refresh=False)Returns a fresh in-memory or on-disk cached record when available. Otherwise it runs a single-URL Cloudflare crawl to produce the record. The returned source’s body path points at the rendered Markdown, which is already suitable for conversion without further processing.
Parameters
origin: str-
The URL to fetch. It is canonicalized before use.
cache_force_refresh: bool = False-
When
True, ignore cached records and request a fresh crawl.
Returns
FetchedSource- The fetched source for origin, with its rendered Markdown body path and metadata.
Raises
ValueError- If Cloudflare does not return a record for origin.
origins()
Discover origins by running a Cloudflare crawl for each root.
Usage
origins(scope, *, progress=True, cache_force_refresh=False)Submits a crawl job for each root in the scope and yields the canonical URL of every completed page record that passes the scope’s pattern and type filters. Results are cached per root so repeated runs can avoid new Cloudflare API calls while the cache is fresh.
Parameters
scope: CrawlScope-
The CrawlScope describing what to crawl. roots must be
httporhttpsURLs. progress: bool = True-
Unused; accepted for interface compatibility.
cache_force_refresh: bool = False-
When
True, submit a fresh crawl job instead of reusing cached results.
Returns
Iterator[str]- A lazy iterator of unique canonical URLs returned by Cloudflare.