scrape.find_links()

Discover hyperlinks starting from one or many documents and return them as URLs.

Usage

Source

scrape.find_links(
    x,
    depth=0,
    children_only=False,
    progress=True,
    *,
    url_filter=None,
    validate=False,
    **request_kwargs
)

Parameters

x: str | Path | Sequence[str | Path]

Starting URL(s). Accepts strings or paths; inputs must expand to HTTP(S) URLs.

depth: int = 0

Maximum traversal depth from each starting document. 0 inspects the starting pages only, 1 also inspects their direct children, and so on.

children_only: bool = False

When True, only links that stay under the originating host are returned and traversed.

progress: bool = True

Whether to display a progress bar while traversing links. Falls back to a no-op when tqdm is not available.

url_filter: Callable[[set[str]], list[str]] | None = None

Receives a list of URL’s and decides returns a list of urls that should be kept. POssibly smaller.

validate: bool = False

When True, perform a lightweight validation to ensure targets are reachable before including them in the results.

request_kwargs: Any = {}
Additional keyword arguments forwarded to requests.Session.get() (and head during validation) when fetching HTTP resources.

Returns

Iterator[str]
Yields absolute link targets, deduplicated and ordered as discovered.