crawl.CrawlScope

Declarative description of what a crawler should discover.

Usage

Source

crawl.CrawlScope()

A CrawlScope is the traversal policy shared by every crawler. It names the starting points and the rules used to decide which origins are followed and yielded. The same scope can be reused across DirectoryCrawler, WebCrawler, and CloudflareCrawler, though a few fields are interpreted slightly differently per backend.

Parameter Attributes

roots: RootsInput

Starting files, directories, or URLs. May be a single value or a sequence of values.

include_patterns: PatternsInput = None

Patterns that an origin must match to be yielded. A str is treated as a glob: * matches any run of characters except /, and ** matches across / (a trailing /** also matches the bare parent, so /docs/** matches /docs too). Pass a compiled re.Pattern to match by regular expression instead. Accepts a single pattern or a sequence; when None, all origins are allowed.

exclude_patterns: PatternsInput = None

Patterns that drop an origin from the crawl, taking precedence over include_patterns. Uses the same glob-or-re.Pattern syntax.

depth: int | None = None

Number of link or directory levels to follow beyond the roots. 0 means only the roots themselves. When None, traversal is effectively unbounded. Must be non-negative.

limit: int | None = None

Maximum number of origins to yield. When None, no limit is applied. Must be non-negative.

include_types: Sequence[str] | None = None

Type labels to include, such as "html", "markdown", "pdf", "python", or "text". When None or empty, all types are allowed.

exclude_types: Sequence[str] | None = None

Type labels to skip. Takes precedence over include_types.

include_external_links: bool = False

Allow origins outside the root origin (a different scheme, host, or port). Defaults to False.

include_subdomains: bool = False
Allow origins on subdomains of the root host. Defaults to False.