---------------------------------------------------------------------- This is the API documentation for the raghilda library. ---------------------------------------------------------------------- ## Store Vector storage backends for storing and retrieving chunks BaseStore() Abstract base class for vector stores. A store is responsible for storing documents and their embeddings, and retrieving relevant chunks based on similarity search. Subclasses must implement all abstract methods to provide a concrete storage backend: - :py:class:`raghilda.store.DuckDBStore`: local storage with embedding and BM25 search. - :py:class:`raghilda.store.ChromaDBStore`: local storage using ChromaDB. - :py:class:`raghilda.store.OpenAIStore`: hosted storage using OpenAI's Vector Store API. DuckDBStore(con: _duckdb.DuckDBPyConnection, metadata: raghilda._store_metadata.EmbeddedAttributesStoreMetadata) A vector store backed by DuckDB. DuckDBStore provides local vector storage with support for both semantic search (using embeddings) and full-text search (using BM25). Data is persisted to a DuckDB database file. Examples -------- ```{python} #| eval: false from raghilda.store import DuckDBStore from raghilda.embedding import EmbeddingOpenAI # Create a new store with embeddings store = DuckDBStore.create( location="my_store.db", embed=EmbeddingOpenAI(), ) # Insert a chunked document from raghilda.document import MarkdownDocument from raghilda.chunker import MarkdownChunker doc = MarkdownDocument( origin="https://example.com/doc1.md", content="# Example This is a sample document.", ) store.upsert(MarkdownChunker().chunk(doc)) # Retrieve similar chunks chunks = store.retrieve("How do I use this?", top_k=5) ``` ChromaDBStore(client: 'Any', collection: 'Any', metadata: 'AttributesStoreMetadata') A vector store backed by ChromaDB. ChromaDBStore provides local vector storage using Chroma's embedded client. Documents are chunked by raghilda and embeddings are generated by Chroma's embedding function (defaults to Chroma's built-in embedding). Examples -------- ```{python} #| eval: false from raghilda.store import ChromaDBStore store = ChromaDBStore.create(location="raghilda_chroma", name="docs") store.upsert(markdown_doc) chunks = store.retrieve("hello world", top_k=3) ``` OpenAIStore(client: Any, store_id: str, *, attributes_spec: Optional[Mapping[str, raghilda._attribute_schema.AttributeSpec]] = None, attributes: Optional[Mapping[str, type[str] | type[int] | type[float] | type[bool] | raghilda._attribute_schema.AttributeFloatVectorType | raghilda._attribute_schema.AttributeStructType]] = None) A vector store backed by OpenAI's Vector Store API. OpenAIStore uses OpenAI's hosted vector storage service for document storage and retrieval. Documents are uploaded as files and automatically chunked and embedded by OpenAI. Examples -------- ```{python} #| eval: false from raghilda.store import OpenAIStore # Create a new store store = OpenAIStore.create(name="my-store") # Or connect to an existing store store = OpenAIStore.connect(store_id="vs_abc123") # Insert documents from raghilda.document import MarkdownDocument doc = MarkdownDocument(content="# Hello\nWorld", origin="example.md") store.upsert(doc) # Retrieve similar chunks chunks = store.retrieve("greeting", top_k=5) ``` PostgreSQLStore(con: psycopg2.extensions.connection, metadata: dict, schema: str) A store backed by a PostgreSQL database with pgvector. Uses PostgreSQL for storage with two retrieval methods: - **Full-text search** via :meth:`retrieve_fts`: uses PostgreSQL's built-in ``tsvector``/``tsquery`` with ``ts_rank`` for ranking. A pre-computed ``tsvector`` column with a GIN index is created automatically. - **Vector similarity search** via :meth:`retrieve_vss`: uses pgvector for nearest-neighbor search over embeddings. An HNSW index for cosine distance is created automatically when an embedding provider is given. Use :meth:`build_index` to add indexes for other distance methods (L2, inner product). ## Embedding Embedding providers for generating vector representations EmbeddingProvider() Interface for embedding function providers. To create a custom embedding provider: 1. Subclass `EmbeddingProvider` and implement `embed()`, `get_config()`, and `from_config()` 2. Register it with `@register_embedding_provider("MyProvider")` Registered providers are automatically restored when connecting to a database that was created with that provider. Examples -------- ```{python} #| eval: false from raghilda.embedding import EmbeddingProvider, register_embedding_provider @register_embedding_provider("MyCustomEmbedding") class MyCustomEmbedding(EmbeddingProvider): def __init__(self, model: str = "default", api_key: str | None = None): self.model = model self.api_key = api_key # Initialize your embedding client here def embed(self, x, input_type=None): # Return list of embedding vectors ... def get_config(self): # Return config dict (exclude sensitive values like api_key) return {"type": "MyCustomEmbedding", "model": self.model} @classmethod def from_config(cls, config): return cls(model=config.get("model", "default")) ``` EmbedInputType(*values) Specifies the type of input being embedded. Some embedding models (e.g., Cohere) produce different embeddings for queries vs documents to optimize retrieval performance. EmbeddingOpenAI(model: str = 'text-embedding-3-small', base_url: str = 'https://api.openai.com/v1', api_key: Optional[str] = None, batch_size: int = 20) -> None Creates an embedding function provider backed by OpenAI's embedding models Implements the [EmbeddingProvider](`raghilda.EmbeddingProvider`) interface. Parameters ---------- model The OpenAI embedding model to use. Default is "text-embedding-3-small" base_url The base URL for the OpenAI API. Default is "https://api.openai.com/v1". api_key The API key for authenticating with OpenAI. If None, it will use the OPENAI_API_KEY environment variable if set. batch_size The number of texts to process in each batch when calling the API. Examples -------- ```{python} #| eval: false from raghilda.embedding import EmbeddingOpenAI provider = EmbeddingOpenAI(model="text-embedding-3-small") embeddings = provider.embed(["hello world", "testing embeddings"]) print(len(embeddings)) print(len(embeddings[0])) # Dimension of the embedding print(embeddings[0][:10]) # The embedding vector ``` EmbeddingCohere(model: str = 'embed-english-v3.0', api_key: Optional[str] = None, batch_size: int = 96) -> None Creates an embedding function provider backed by Cohere's embedding models. Implements the [EmbeddingProvider](`raghilda.EmbeddingProvider`) interface. Cohere's embedding models produce different embeddings for queries vs documents to optimize retrieval performance. Use `input_type=EmbedInputType.QUERY` when embedding search queries and `input_type=EmbedInputType.DOCUMENT` (default) when embedding documents for indexing. Parameters ---------- model The Cohere embedding model to use. Default is "embed-english-v3.0". api_key The API key for authenticating with Cohere. If None, it will use the CO_API_KEY environment variable if set. batch_size The number of texts to process in each batch when calling the API. Cohere supports up to 96 texts per request. Examples -------- ```{python} #| eval: false from raghilda.embedding import EmbeddingCohere, EmbedInputType provider = EmbeddingCohere(model="embed-english-v3.0") # Embed documents for indexing doc_embeddings = provider.embed( ["Hello world", "Testing embeddings"], input_type=EmbedInputType.DOCUMENT ) # Embed a query for search query_embedding = provider.embed( ["How do I test embeddings?"], input_type=EmbedInputType.QUERY ) ``` EmbeddingSentenceTransformers(model: str = 'all-MiniLM-L6-v2', device: Optional[str] = None, batch_size: int = 64, prompts: Optional[dict[raghilda._embedding.EmbedInputType, str]] = None) -> None Creates an embedding function provider backed by sentence-transformers models. Implements the [EmbeddingProvider](`raghilda.EmbeddingProvider`) interface. This provider runs models locally using the `sentence-transformers` library, enabling offline/private embedding without external API calls. Parameters ---------- model The sentence-transformers model to use. Default is "all-MiniLM-L6-v2". Any model from the Hugging Face Hub that is compatible with sentence-transformers can be used. device The device to run the model on (e.g., "cpu", "cuda", "mps"). If None, sentence-transformers will auto-detect the best available device. batch_size The number of texts to process in each batch. prompts Optional mapping from `EmbedInputType` to a prefix string to prepend to each text before encoding. This is useful for models that require task-specific prefixes (e.g., nomic-embed-text uses "search_query: " and "search_document: "). Examples -------- Install raghilda with sentence-transformers support: ```bash pip install raghilda[sentence-transformers] ``` ```{python} #| eval: false from raghilda.embedding import EmbeddingSentenceTransformers provider = EmbeddingSentenceTransformers(model="all-MiniLM-L6-v2") embeddings = provider.embed(["hello world", "testing embeddings"]) print(len(embeddings)) print(len(embeddings[0])) # Dimension of the embedding ``` For models that use task-specific prefixes: ```{python} #| eval: false from raghilda.embedding import EmbeddingSentenceTransformers, EmbedInputType provider = EmbeddingSentenceTransformers( model="nomic-ai/nomic-embed-text-v1.5", prompts={ EmbedInputType.QUERY: "search_query: ", EmbedInputType.DOCUMENT: "search_document: ", }, ) # Queries get "search_query: " prepended automatically query_emb = provider.embed(["Who is Laurens van Der Maaten?"], EmbedInputType.QUERY) # Documents get "search_document: " prepended automatically doc_emb = provider.embed(["TSNE is a dimensionality reduction algorithm"]) ``` ## Chunker Text chunking utilities for splitting documents BaseChunker() Base class for chunkers. A chunker splits a :py:class:`raghilda.document.Document` into a :py:class:`raghilda.document.ChunkedDocument` containing smaller text segments suitable for embedding and retrieval. Subclasses must implement :py:meth:`chunk` and :py:meth:`chunk_text` to provide a concrete chunking strategy: - :py:class:`raghilda.chunker.MarkdownChunker`: splits Markdown documents at semantic boundaries (headings, paragraphs, sentences). MarkdownChunker(chunk_size: int = 1600, target_overlap: float = 0.5, *, max_snap_distance: int = 20, segment_by_heading_levels: Optional[list[int]] = None) -> None Chunk Markdown documents into overlapping segments at semantic boundaries. This chunker divides Markdown text into smaller, overlapping chunks while intelligently positioning cut points at semantic boundaries like headings, paragraphs, sentences, and words. Rather than cutting rigidly at character counts, it nudges cut points to the nearest sensible boundary, producing more semantically coherent chunks suitable for RAG applications. Parameters ---------- chunk_size Target size for each chunk in characters. The chunker attempts to create chunks near this size, though actual sizes may vary based on semantic boundaries. Default is 1600 characters. target_overlap Fraction of overlap between successive chunks, from 0 to 1. Default is 0.5 (50% overlap). Even with 0, some overlap may occur because the last chunk is anchored to the document end. max_snap_distance Maximum distance (in characters) to move a cut point to reach a semantic boundary. If no boundary is found within this distance, the cut point stays at its original position. Default is 20. segment_by_heading_levels List of heading levels (1-6) that act as hard boundaries. When specified, no chunk will cross these headings, and segments between them are chunked independently. For example, `[1, 2]` ensures chunks never span across h1 or h2 headings. Examples -------- ```{python} from raghilda.chunker import MarkdownChunker chunker = MarkdownChunker( chunk_size=100, target_overlap=0.2, segment_by_heading_levels=[1, 2], ) text = '''# Introduction This is the introduction section with some content. ## Background Here is background information that provides context. ## Methods The methods section describes our approach. ''' chunks = chunker.chunk_text(text) for chunk in chunks: print(f"[{chunk.start_index}:{chunk.end_index}] {chunk.text[:40]}...") ``` Notes ----- The chunking algorithm works as follows: 1. Parse the Markdown to identify semantic boundaries (headings, paragraphs, sentences, lines, words) 2. If `segment_by_heading_levels` is set, split the document at those headings first 3. For each segment, calculate target chunk boundaries based on `chunk_size` and `target_overlap` 4. Snap each boundary to the nearest semantic boundary (preferring headings > paragraphs > sentences > lines > words) 5. Extract chunks with their positional information and heading context ## Utilities Utility functions for reading and scraping content read_as_markdown(uri: str, html_extract_selectors: Optional[list[str]] = None, html_zap_selectors: Optional[list[str]] = None, *args, **kwargs) -> raghilda.document.MarkdownDocument Read a markdown file from a URI and return its content as a string. Parameters ---------- uri The URI of the markdown file to read. Supported schemes are: - path/to/file.md - http://example.com/file.md - https://example.com/file.md html_extract_selectors A list of CSS selectors to extract specific parts of the HTML content when the URI points to an HTML page. Defaults to ['main']. html_zap_selectors A list of CSS selectors to remove specific parts of the HTML content when the URI points to an HTML page. Defaults to ['nav']. Returns ------- MarkdownDocument The content of the markdown file as a MarkdownDocument object. Examples -------- ```{python} #| eval: false from raghilda.read import read_as_markdown # Read from a local file md_content = read_as_markdown("path/to/file.md") print(md_content) # Read from an HTTP URL md_content = read_as_markdown("https://raw.githubusercontent.com/user/repo/branch/file.md") print(md_content) ``` find_links(x: 'str | Path | Sequence[str | Path]', depth: 'int' = 0, children_only: 'bool' = False, progress: 'bool' = True, *, url_filter: 'Callable[[set[str]], list[str]] | None' = None, validate: 'bool' = False, **request_kwargs: 'Any') -> 'list[str]' Discover hyperlinks starting from one or many documents and return them as URLs. Parameters ---------- x Starting URL(s). Accepts strings or paths; inputs must expand to HTTP(S) URLs. depth Maximum traversal depth from each starting document. ``0`` inspects the starting pages only, ``1`` also inspects their direct children, and so on. children_only When ``True``, only links that stay under the originating host are returned and traversed. progress Whether to display a progress bar while traversing links. Falls back to a no-op when :mod:`tqdm` is not available. url_filter Receives a list of URL's and decides returns a list of urls that should be kept. POssibly smaller. validate When ``True``, perform a lightweight validation to ensure targets are reachable before including them in the results. request_kwargs Additional keyword arguments forwarded to :func:`requests.Session.get` (and ``head`` during validation) when fetching HTTP resources. Returns ------- Iterator[str] Yields absolute link targets, deduplicated and ordered as discovered. ## Chunk Chunk data types Chunk(text: str, start_index: int, end_index: int, char_count: int, context: Optional[str] = None, origin: Optional[str] = None, attributes: Optional[dict[str, Any]] = None) -> None A segment of text extracted from a document. Chunks are the fundamental unit for retrieval in RAG applications. Each chunk contains the text content along with positional information that allows mapping back to the original document. Attributes ---------- text The actual text content of the chunk. start_index Character position where this chunk begins in the source document. end_index Character position where this chunk ends in the source document. char_count Number of characters in this chunk. context Optional heading context showing the document hierarchy at this chunk's position (e.g., the Markdown headings that apply). origin Origin of the parent document this chunk belongs to. attributes Optional user-defined attributes associated with the chunk. These attributes can be used for retrieval filtering/scoping and downstream prompt/context construction. MarkdownChunk(text: str, start_index: int, end_index: int, char_count: int, context: Optional[str] = None, origin: Optional[str] = None, attributes: Optional[dict[str, Any]] = None) -> None A chunk extracted from a Markdown document. MarkdownChunk extends Chunk for use with Markdown content. It typically preserves heading context from the source document, allowing retrieval results to show where in the document hierarchy each chunk originated. RetrievedChunk(text: str, start_index: int, end_index: int, char_count: int, context: Optional[str] = None, origin: Optional[str] = None, attributes: Optional[dict[str, Any]] = None, metrics: list[raghilda.chunk.Metric] = , chunk_ids: list[int] = ) -> None A chunk returned from a retrieval operation with associated metrics. RetrievedChunk extends Chunk with retrieval metrics that indicate how well the chunk matched the query. Common metrics include similarity scores and BM25 scores. Attributes ---------- metrics List of Metric objects containing retrieval scores. chunk_ids Backend chunk identifiers represented by this retrieved chunk. For non-deoverlapped results this usually contains one id. For deoverlapped chunks it may include multiple source chunk ids. Examples -------- ```{python} from raghilda.chunk import RetrievedChunk, Metric chunk = RetrievedChunk( text="This is relevant content.", start_index=0, end_index=25, char_count=25, metrics=[ Metric(name="similarity", value=0.92), Metric(name="bm25_score", value=15.3), ], ) for metric in chunk.metrics: print(f"{metric.name}: {metric.value}") ``` Metric(name: str, value: float) -> None A named metric value associated with a retrieved chunk. Metrics are used to store retrieval scores and other measurements that describe how well a chunk matches a query. Attributes ---------- name The name of the metric (e.g., "similarity", "bm25_score"). value The numeric value of the metric. Examples -------- ```{python} from raghilda.chunk import Metric similarity = Metric(name="similarity", value=0.95) print(f"{similarity.name}: {similarity.value}") ``` ## Document Document types for unchunked and chunked content Document(content: 'str', origin: 'Optional[str]' = None, attributes: 'Optional[dict[str, Any]]' = None) -> None A document containing text content to be chunked and indexed. Documents are the primary input for RAG stores. Each document has text content and an optional origin identifier. Attributes ---------- content The full text content of the document. origin Unique origin identifier for the document. This can be None or an empty string while preparing a document object, but stores require a populated origin for upsert operations. attributes Optional user-defined attributes applied at document insertion time. Document-level attributes can be inherited by chunks and returned during retrieval for filtering and downstream prompt/context use. ChunkedDocument(content: 'str', origin: 'Optional[str]' = None, attributes: 'Optional[dict[str, Any]]' = None, *, chunks: 'list[Chunk]') -> None A document with an attached sequence of chunks. This is the explicit chunked variant of `Document`, used by stores and chunkers that operate on pre-segmented content. MarkdownDocument(content: 'str', origin: 'Optional[str]' = None, attributes: 'Optional[dict[str, Any]]' = None) -> None A Markdown document with source tracking. MarkdownDocument extends Document with markdown-specific semantics for content that comes from a source origin (e.g., URL or file path). This is useful for citation and provenance tracking in RAG applications. Examples -------- ```{python} from raghilda.document import MarkdownDocument # Create from content directly doc = MarkdownDocument( content="# Hello World\n\nThis is a test document.", origin="https://example.com/hello.md", ) print(f"Document from: {doc.origin}") print(f"Content length: {len(doc.content)} characters") ``` ChunkedMarkdownDocument(content: 'str', origin: 'Optional[str]' = None, attributes: 'Optional[dict[str, Any]]' = None, *, chunks: 'list[Chunk]') -> None A Markdown document with an attached sequence of chunks. ## Types Protocol types for type checking compatibility ChunkLike(*args, **kwargs) Any chunk-like object (chonkie, raghilda, or custom). ChunkedDocumentLike(*args, **kwargs) Any chunked document-like object. DocumentLike(*args, **kwargs) Any document-like object. ChunkerLike(*args, **kwargs) Any chunker-like object (chonkie, raghilda, or custom). IntoChunk(*args, **kwargs) Any object that can be converted into a Chunk via to_chunk(). IntoDocument(*args, **kwargs) Any object that can be converted into a Document via to_document(). ---------------------------------------------------------------------- This is the User Guide documentation for the package. ---------------------------------------------------------------------- ## Getting Started ### Core Concepts Large language models (LLMs) sometimes generate confident but incorrect information—a phenomenon known as hallucination. This happens because LLMs work by predicting the most likely next words based on patterns learned during training, without any inherent concept of truth or factual accuracy. ## Why RAG? Retrieval-Augmented Generation (RAG) addresses this by grounding LLM responses in trusted source material. Instead of relying solely on the model's training data, RAG retrieves relevant content from a curated knowledge base and includes it in the prompt. This shifts the model's role from open-ended generation to summarizing vetted content. While RAG doesn't eliminate hallucinations entirely, it significantly reduces them for domain-specific applications by ensuring responses are anchored in authoritative sources. ## Building a RAG System A RAG system has two main phases: 1. **Preparation**: Building a searchable knowledge store from your documents 2. **Retrieval**: Finding relevant content to augment LLM prompts Let's walk through building a RAG system using the [Quarto documentation](https://quarto.org/docs/guide/) as our knowledge base. ## Creating a Store First, create a store with an embedding provider. The store will hold your document chunks and their vector embeddings: ```{python} from raghilda.store import DuckDBStore from raghilda.embedding import EmbeddingOpenAI store = DuckDBStore.create( location="quarto_docs.db", embed=EmbeddingOpenAI(), name="quarto", title="Quarto Documentation", overwrite=True, ) ``` raghilda supports multiple embedding providers (OpenAI, Cohere) and storage backends (DuckDB, ChromaDB, OpenAI Vector Stores, PostgreSQL). See the [API Reference](/reference/index.qmd) for all options. ## Finding Documents Next, identify the documents to include. The `find_links()` function can crawl a website to discover pages: ```{python} from raghilda.scrape import find_links links = find_links( "https://quarto.org/docs/guide/", depth=1, # follow links 1 level deep from the starting page children_only=True, ) print(f"Found {len(links)} pages") ``` The `depth` parameter controls how many levels of links to follow, and `children_only=True` restricts crawling to pages under the starting URL. You can also work with local files or provide a list of URLs directly: ```{python} #| eval: false # Local files links = ["docs/guide.md", "docs/reference.md", "docs/tutorial.md"] # Or use glob patterns with pathlib from pathlib import Path links = list(Path("docs").glob("**/*.md")) ``` ## Preparing Documents Prepare each document explicitly by reading it, chunking it, and passing the result to `upsert()`: ```{python} from raghilda.read import read_as_markdown from raghilda.chunker import MarkdownChunker chunker = MarkdownChunker() for link in links: document = read_as_markdown(link) chunked = chunker.chunk(document) store.upsert(chunked) print(f"Indexed {store.size()} documents") ``` That is the full preparation phase. Each document is converted to Markdown, split into overlapping chunks, embedded, and written to the store through explicit calls that keep the indexing pipeline visible. ## What Happens During Preparation Each item you index typically goes through two steps before it is stored: **1. Convert to Markdown** — `read_as_markdown()` converts the item (a URL or file path) into a Markdown document. It handles HTML pages, PDFs, DOCX files, and more using [MarkItDown](https://github.com/microsoft/markitdown). For HTML, it extracts the `
` element and removes `