import os
account_id = os.environ["CLOUDFLARE_ACCOUNT_ID"]
api_token = os.environ["CLOUDFLARE_API_TOKEN"]Cloudflare Browser Rendering Crawler
The CloudflareCrawler delegates all page fetching and rendering to Cloudflare’s Browser Rendering API. This provides a lot of benefits if you have an account with Cloudflare:
- you effectively offload potentially very long-running crawling tasks, and possibly Markdown conversion, from the local host to Cloudflare’s servers
- those websites that load their content entirely through JavaScript (which make text retrieval generally problematic) are handled by Cloudflare
- you don’t have to run a headless browser locally, you simply receive ready-to-chunk text from the paid service
This guide covers how to set up and use CloudflareCrawler to build a store. It assumes some familiarity with the general crawling workflow described in Crawling and Ingestion.
Prerequisites
CloudflareCrawler requires two credentials from a Cloudflare account that has the Browser Rendering API enabled:
- Account ID: found on your Cloudflare dashboard under the account overview page
- API Token: a bearer token with permission to access the Browser Rendering API
You should store both as environment variables rather than hardcoding them in scripts:
If the credentials are missing or the account does not have Browser Rendering enabled, the crawler will raise an error on the first API call.
Basic usage
A complete crawl-to-store pipeline needs four big pieces:
- a store to write into
- a crawler to fetch and render pages
- a scope describing which pages to include
- a chunker to split the rendered Markdown into retrieval-sized pieces
You also need to decide on an embedding provider for the store (or defer that by passing embed=None if you plan to add embeddings later).
The example below uses all of the defaults for the crawler beyond the required credentials. That essentially means that:
- browser rendering is on
- all discovery methods are active
- no caching is configured
- no filtering patterns are applied
This is a pretty reasonable starting point for an initial exploration of a site before tuning scope and caching for production use.
import os
from raghilda.chunker import MarkdownChunker
from raghilda.crawl import CloudflareCrawler, CrawlScope
from raghilda.embedding import EmbeddingOpenAI
from raghilda.store import DuckDBStore
# 1. Create a store with an embedding provider
store = DuckDBStore.create(
location="rendered_docs.db",
embed=EmbeddingOpenAI(),
name="rendered_docs",
title="Rendered Documentation",
overwrite=True,
)
# 2. Set up the crawler with Cloudflare credentials
crawler = CloudflareCrawler(
account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"],
api_token=os.environ["CLOUDFLARE_API_TOKEN"],
)
# 3. Define the crawl scope
scope = CrawlScope(
roots=["https://example.com/docs/"],
depth=2,
)
# 4. Chunk and ingest the crawled pages
chunker = MarkdownChunker()
summary = store.ingest(
crawler.markdown_documents(scope),
prepare=chunker.chunk,
max_workers=4,
)
# 5. Build retrieval indexes
store.build_index()
print(summary)IngestSummary(inserted=47, replaced=0, skipped=0)
This posts a crawl job to Cloudflare, polls until it completes, retrieves the rendered Markdown for each discovered page, chunks it, and writes everything to the store. The depth=2 setting tells Cloudflare to follow links up to two levels from the root URL.
The store.build_index() call at the end creates an HNSW vector index and a BM25 keyword index over the stored chunks. These indexes make subsequent retrieve() calls fast. Building them after all documents are ingested is more efficient than updating them incrementally during writes, which is why it appears as a separate step at the end.
The printed IngestSummary reports how many documents were newly added, how many had their content replaced, and how many were skipped because identical content was already in the store.
How it differs from WebCrawler
WebCrawler fetches raw HTML with requests and converts it to Markdown locally. CloudflareCrawler offloads both the fetching and the Markdown conversion to Cloudflare’s infrastructure. The practical differences are:
- JavaScript rendering: CloudflareCrawler executes JavaScript before extracting content. Single-page applications, dynamically-loaded documentation sites, and client-rendered dashboards all work without additional configuration.
- No local browser required: you do not need Playwright, Selenium, or any headless browser installed.
- Cloudflare handles link discovery: instead of parsing anchor tags from raw HTML (which may not exist until JavaScript runs), Cloudflare discovers links from the rendered DOM.
- Markdown can arrive pre-converted: the API can return Markdown directly, with HTML-to-Markdown conversion performed by Cloudflare. This is optional: you can still choose to receive raw HTML and then perform HTML-to-Markdown conversion locally using Raghilda’s built-in converter or another converter of your choice.
The tradeoff is that CloudflareCrawler requires a Cloudflare account with Browser Rendering access and incurs API usage costs. For static HTML sites, WebCrawler is simpler (and free).
Browser rendering
The render= parameter controls whether Cloudflare executes JavaScript before extracting page content. It defaults to True, which is the right choice for any site that relies on client-side rendering. When enabled, Cloudflare loads each page in a headless browser, waits for scripts to finish, and extracts text from the fully populated DOM.
Setting it explicitly for clarity:
# JavaScript is executed (default)
crawler = CloudflareCrawler(
account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"],
api_token=os.environ["CLOUDFLARE_API_TOKEN"],
render=True,
)We should set render=False if the target site is server-rendered HTML and does not need JavaScript execution. This reduces crawl time and Cloudflare API usage:
# Skip JavaScript execution for static sites
crawler = CloudflareCrawler(
account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"],
api_token=os.environ["CLOUDFLARE_API_TOKEN"],
render=False,
)If you are unsure whether a site needs rendering, start with the default (render=True) and inspect a few pages. If the returned Markdown already contains the expected content, switching to render=False will speed up the crawl and reduce your Cloudflare API usage without losing any text.
Page discovery with the source parameter
The source= parameter controls how Cloudflare discovers pages on the target site. The default is "all", which combines multiple discovery methods:
# Use all available discovery methods (default)
crawler = CloudflareCrawler(
account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"],
api_token=os.environ["CLOUDFLARE_API_TOKEN"],
source="all",
)Other options for source= are:
"sitemap": only discover pages listed in the site’ssitemap.xml. This is efficient for well-maintained sites where the sitemap is comprehensive."crawl": follow links from the rendered DOM, similar to traditional web crawling but with JavaScript execution."urls": only process the explicitly provided root URLs, without following any links.
For a documentation site with a complete sitemap, "sitemap" is typically the fastest option because it avoids rendering every page just to find links:
crawler = CloudflareCrawler(
account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"],
api_token=os.environ["CLOUDFLARE_API_TOKEN"],
source="sitemap",
)
scope = CrawlScope(
roots=["https://example.com/docs/"],
depth=0,
)With source="sitemap" and depth=0, Cloudflare reads the sitemap from the root and returns all listed pages without further link-following.
Filtering with glob patterns
The include_patterns= and exclude_patterns= fields on CrawlScope use the same glob syntax for every crawler, so a scope you tuned against WebCrawler behaves identically with CloudflareCrawler. A single * matches any characters except / (one path segment), and ** matches any characters including / (any number of segments). Patterns are matched against the full URL:
scope = CrawlScope(
roots=["https://example.com/"],
depth=2,
include_patterns=["https://example.com/docs/**"],
exclude_patterns=[
"https://example.com/docs/archive/**",
"https://example.com/docs/internal/**",
],
limit=500,
)The include_patterns list restricts the crawl to URLs matching at least one pattern. The exclude_patterns list removes URLs that match any pattern, and always takes priority over include_patterns. The limit= field caps the total number of pages returned.
Because Cloudflare’s crawl API speaks the same */** wildcard language, glob patterns are forwarded directly to it so the filtering happens server-side. For matching that globs cannot express, pass a pre-compiled re.Pattern (matched with re.search) instead of a string:
import re
scope = CrawlScope(
roots=["https://example.com/"],
depth=2,
include_patterns=[re.compile(r"/docs/v\d+/")],
)Since the Cloudflare API cannot evaluate regexes, compiled patterns are not forwarded; CloudflareCrawler applies them locally to the records the crawl returns. You can mix glob strings and compiled regexes in the same list, and the glob portion is still pushed to the API for efficient server-side filtering.
Two additional scope fields are relevant for Cloudflare crawls:
include_external_links=Trueallows the crawler to follow links to other domains.include_subdomains=Trueallows the crawler to follow links to subdomains of the root host (e.g.,docs.example.comwhen crawling fromexample.com).
Both default to False, which keeps the crawl focused on the root domain.
Caching
CloudflareCrawler accepts the same caching parameters as WebCrawler, though the underlying behavior differs slightly because results come from Cloudflare’s API rather than direct HTTP requests. Enable caching with cache_dir=True to store results under .raghilda/cache/cloudflare, or pass a custom path:
from datetime import timedelta
crawler = CloudflareCrawler(
account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"],
api_token=os.environ["CLOUDFLARE_API_TOKEN"],
cache_dir=True,
cache_stale_after=timedelta(days=1),
)With caching enabled, the crawler stores both the crawl job results (list of discovered URLs and their rendered Markdown) and the individual page records on disk. On subsequent runs, fresh cached results are reused without making any Cloudflare API calls.
The cache_stale_after= parameter controls when cached results are considered stale. When a cached entry is stale, the crawler sends a new request to Cloudflare with a maxAge hint asking for updated content. When cache_stale_after is not set, cached entries never expire.
To force a completely fresh crawl that bypasses the cache entirely, pass cache_force_refresh=True to markdown_documents():
# Ignore all cached results, re-crawl everything
documents = crawler.markdown_documents(scope, cache_force_refresh=True)The cache validates its entries against a signature that includes the render=, source=, and modified_since= settings. If any of these change between runs, the cache is automatically invalidated. This prevents stale results from a different configuration being reused.
Incremental updates with modified_since=
For stores that need regular updates, the modified_since= parameter restricts the crawl to pages modified after a given Unix timestamp. This reduces the number of pages Cloudflare processes on each run:
import time
# Only include pages modified in the last 7 days
one_week_ago = int(time.time()) - (7 * 24 * 60 * 60)
crawler = CloudflareCrawler(
account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"],
api_token=os.environ["CLOUDFLARE_API_TOKEN"],
cache_dir=True,
modified_since=one_week_ago,
)Combined with store.ingest(), this gives you a lightweight refresh job where only recently changed pages are fetched and upserted (while unchanged documents are skipped by the store’s own deduplication logic).
Polling behavior
Cloudflare processes crawl jobs asynchronously. After submitting a job, the crawler polls for completion at regular intervals. Two parameters control this behavior:
poll_interval=5.0: seconds to wait between status checks. The default of 5 seconds is reasonable for most jobs.max_poll_attempts=60: maximum number of polls before raising aTimeoutError. With the default interval, this gives a 5-minute window.
For large sites that take longer to crawl, we should increase the timeout:
crawler = CloudflareCrawler(
account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"],
api_token=os.environ["CLOUDFLARE_API_TOKEN"],
poll_interval=10.0,
max_poll_attempts=120,
)This configuration waits up to 20 minutes for a crawl job to finish.
Inspecting discovered pages
Before committing to a full ingest of a large site, you may want to spot-check what Cloudflare discovers and returns. CloudflareCrawler exposes lower-level methods for this kind of inspection. Use origins() to see what pages Cloudflare finds without converting them into documents:
for origin in crawler.origins(scope):
print(origin)Use fetch_raw() to retrieve the full FetchedSource for a single page, which includes metadata like the HTTP status code and content type:
source = crawler.fetch_raw("https://example.com/docs/getting-started")
print(f"Status: {source.status_code}")
print(f"Fetched at: {source.fetched_at}")
print(f"Body at: {source.body_path}")Use fetch_markdown() to get a single MarkdownDocument:
doc = crawler.fetch_markdown("https://example.com/docs/getting-started")
print(doc.content[:500])These methods are useful for debugging scope configuration or inspecting what Cloudflare returns before committing to a full ingest.
Full example
The following script builds a store from a JavaScript-rendered documentation site. It uses caching so that repeated runs during development avoid redundant API calls, and it sets a one-day staleness window for production refresh jobs.
import os
from datetime import timedelta
from pathlib import Path
from raghilda.chunker import MarkdownChunker
from raghilda.crawl import CloudflareCrawler, CrawlScope
from raghilda.embedding import EmbeddingOpenAI
from raghilda.store import DuckDBStore
DB_PATH = Path("spa_docs.db")
def build_store() -> DuckDBStore:
store = DuckDBStore.create(
location=str(DB_PATH),
embed=EmbeddingOpenAI(),
name="spa_docs",
title="SPA Documentation",
overwrite=True,
)
crawler = CloudflareCrawler(
account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"],
api_token=os.environ["CLOUDFLARE_API_TOKEN"],
cache_dir=True,
cache_stale_after=timedelta(days=1),
render=True,
source="all",
max_workers=4,
)
scope = CrawlScope(
roots=["https://my-spa-docs.example.com/"],
depth=2,
include_patterns=["https://my-spa-docs.example.com/**"],
exclude_patterns=["https://my-spa-docs.example.com/internal/**"],
limit=1000,
)
chunker = MarkdownChunker(chunk_size=1600, target_overlap=0.5)
summary = store.ingest(
crawler.markdown_documents(scope),
prepare=chunker.chunk,
max_workers=4,
)
store.build_index()
print(f"Inserted: {summary.inserted}")
print(f"Replaced: {summary.replaced}")
print(f"Skipped: {summary.skipped}")
return store
def refresh_store() -> None:
store = DuckDBStore.connect(str(DB_PATH))
crawler = CloudflareCrawler(
account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"],
api_token=os.environ["CLOUDFLARE_API_TOKEN"],
cache_dir=True,
cache_stale_after=timedelta(days=1),
render=True,
max_workers=4,
)
scope = CrawlScope(
roots=["https://my-spa-docs.example.com/"],
depth=2,
include_patterns=["https://my-spa-docs.example.com/**"],
limit=1000,
)
chunker = MarkdownChunker(chunk_size=1600, target_overlap=0.5)
summary = store.ingest(
crawler.markdown_documents(scope),
prepare=chunker.chunk,
max_workers=4,
)
print(f"Refresh complete: {summary.inserted} new, {summary.replaced} updated, {summary.skipped} unchanged")
if __name__ == "__main__":
if DB_PATH.exists():
refresh_store()
else:
build_store()The script has two paths: an initial build that creates the store from scratch, and a refresh path that reconnects and upserts only changed content. The cache and the store’s own deduplication (skip_if_unchanged=True in upsert()) work together to keep refresh runs fast: the cache avoids re-fetching pages whose content has not changed, and the store avoids re-computing embeddings for documents that are identical to what is already stored.
When to use CloudflareCrawler
Choosing between CloudflareCrawler and WebCrawler comes down to whether the target site needs JavaScript to produce its content and whether you are willing to depend on an external service. Neither crawler is strictly better as they serve different situations.
Use CloudflareCrawler when:
- the target site renders content with JavaScript (React, Vue, Angular, etc.)
- you need Cloudflare’s sitemap-aware discovery rather than manual link following
- you want pre-converted Markdown without running local conversion logic
- the site is large enough that Cloudflare’s distributed infrastructure crawls it faster than concurrent local requests
Use WebCrawler when:
- the site is static HTML that does not require JavaScript execution
- you want to avoid external API dependencies and costs
- you need fine-grained control over HTTP headers, cookies, or authentication during fetching
- the crawl is small enough that local
requestscalls are fast enough
Both crawlers implement the same interface (origins(), fetch_raw(), fetch_markdown(), markdown_documents()), so switching between them requires changing only the constructor. The rest of your pipeline (chunking, ingestion, retrieval) remains unchanged.
Conclusion
CloudflareCrawler lets you build retrieval stores from sites that would otherwise be inaccessible to a simple HTTP client. The rendering, link discovery, and Markdown conversion all happen on Cloudflare’s infrastructure, so your local code stays focused on chunking and ingestion. Combined with caching, filtering patterns, and modified_since=, you can keep a store current without redundant API calls or full re-crawls. For sites that do not need JavaScript execution, WebCrawler remains the simpler and cheaper option, and the shared crawler interface means you can switch between the two without restructuring your pipeline.