scrape.find_links()

Discover links by crawling one or more starting pages.

Usage

scrape.find_links(
    x,
    depth=0,
    children_only=False,
    progress=True,
    *,
    url_filter=None,
    validate=False,
    **request_kwargs
)

find_links() is the simple way to gather a set of URLs to index: give it a starting page (or several) and it follows links up to depth levels before returning the URLs it finds. It reads both HTML pages (following <a href> links) and XML sitemaps (collecting their <loc> entries).

This function is eager: it completes the crawl before returning the list of discovered URLs. For a lazy version of this same URL discovery step, see WebCrawler.origins().

Parameters

x: str | Path | Sequence[str | Path]: Starting URL(s) or path(s): a single string or Path, or a sequence of them.
depth: int = 0: Maximum traversal depth from each starting page. 0 inspects the starting pages only, 1 also follows their direct links, and so on.
children_only: bool = False: When True, only links at or below the starting URL are kept and followed (for example links under https://site/docs/ when you start there). This is the easiest way to stay within one section of a site.
progress: bool = True: Whether to display a tqdm progress bar while crawling.
url_filter: Callable[[set[str]], list[str]] | None = None: Optional callback that receives the set of links found on a page and returns the (possibly smaller) list of URLs to keep and follow. Use it to apply custom include/exclude rules.
validate: bool = False: When True, check that each URL is reachable (with an HTTP HEAD request) before including it in the results.
request_kwargs: Any = {}: Extra keyword arguments forwarded to requests.Session.get (and to head during validation) when fetching pages.

Returns

list[str]: A deduplicated list of absolute URLs discovered during the crawl. The crawl completes before this list is returned.

Examples

from raghilda.scrape import find_links

# Discover pages under a documentation section, one level deep
links = find_links(
    "https://quarto.org/docs/guide/",
    depth=1,
    children_only=True,
)
print(f"Found {len(links)} pages")