crawl.CrawlScope

Declarative description of what a crawler should discover.

Usage

crawl.CrawlScope(
    roots,
    include_patterns=None,
    exclude_patterns=None,
    depth=None,
    limit=None,
    include_types=None,
    exclude_types=None,
    include_external_links=False,
    include_subdomains=False
)

A CrawlScope is the traversal policy shared by every crawler. It names the starting points and the rules used to decide which origins are followed and yielded. The same scope can be reused across DirectoryCrawler, WebCrawler, and CloudflareCrawler, though a few fields are interpreted slightly differently per backend.

Parameters

roots: RootsInput: Starting files, directories, or URLs. May be a single value or a sequence of values.
include_patterns: PatternsInput = None: Patterns that an origin must match to be yielded. A str is treated as a glob: * matches any run of characters except /, and ** matches across / (a trailing /** also matches the bare parent, so /docs/** matches /docs too). Pass a compiled re.Pattern to match by regular expression instead. Accepts a single pattern or a sequence; when None, all origins are allowed.
exclude_patterns: PatternsInput = None: Patterns that drop an origin from the crawl, taking precedence over include_patterns. Uses the same glob-or-re.Pattern syntax.
depth: int | None = None: Number of link or directory levels to follow beyond the roots. 0 means only the roots themselves. When None, traversal is effectively unbounded. Must be non-negative.
limit: int | None = None: Maximum number of origins to yield. When None, no limit is applied. Must be non-negative.
include_types: Sequence[str] | None = None: Type labels to include, such as "html", "markdown", "pdf", "python", or "text". When None or empty, all types are allowed.
exclude_types: Sequence[str] | None = None: Type labels to skip. Takes precedence over include_types.
include_external_links: bool = False: Allow origins outside the root origin (a different scheme, host, or port). Defaults to False.
include_subdomains: bool = False: Allow origins on subdomains of the root host. Defaults to False.