crawl.CrawlScope
Declarative description of what a crawler should discover.
Usage
crawl.CrawlScope()A CrawlScope is the traversal policy shared by every crawler. It names the starting points and the rules used to decide which origins are followed and yielded. The same scope can be reused across DirectoryCrawler, WebCrawler, and CloudflareCrawler, though a few fields are interpreted slightly differently per backend.
Parameter Attributes
roots: RootsInput-
Starting files, directories, or URLs. May be a single value or a sequence of values.
include_patterns: PatternsInput = None-
Patterns that an origin must match to be yielded. A
stris treated as a glob:*matches any run of characters except/, and**matches across/(a trailing/**also matches the bare parent, so/docs/**matches/docstoo). Pass a compiledre.Patternto match by regular expression instead. Accepts a single pattern or a sequence; whenNone, all origins are allowed. exclude_patterns: PatternsInput = None-
Patterns that drop an origin from the crawl, taking precedence over include_patterns. Uses the same glob-or-
re.Patternsyntax. depth: int | None = None-
Number of link or directory levels to follow beyond the roots.
0means only the roots themselves. WhenNone, traversal is effectively unbounded. Must be non-negative. limit: int | None = None-
Maximum number of origins to yield. When
None, no limit is applied. Must be non-negative. include_types: Sequence[str] | None = None-
Type labels to include, such as
"html","markdown","pdf","python", or"text". WhenNoneor empty, all types are allowed. exclude_types: Sequence[str] | None = None-
Type labels to skip. Takes precedence over include_types.
include_external_links: bool = False-
Allow origins outside the root origin (a different scheme, host, or port). Defaults to
False. include_subdomains: bool = False-
Allow origins on subdomains of the root host. Defaults to
False.