----------------------------------------------------------------------
This is the API documentation for the raghilda library.
----------------------------------------------------------------------


## Store

Vector storage backends for storing and retrieving chunks


BaseStore()

Abstract base class for vector stores.

A store is responsible for storing documents and their embeddings,
and retrieving relevant chunks based on similarity search.

Subclasses must implement all abstract methods to provide a concrete
storage backend:

- :py:class:`raghilda.store.DuckDBStore`: local storage with embedding and BM25 search.
- :py:class:`raghilda.store.ChromaDBStore`: local storage using ChromaDB.
- :py:class:`raghilda.store.OpenAIStore`: hosted storage using OpenAI's Vector Store API.

DuckDBStore(con: _duckdb.DuckDBPyConnection, metadata: raghilda._store_metadata.EmbeddedAttributesStoreMetadata)

A vector store backed by DuckDB.

    DuckDBStore provides local vector storage with support for both
    semantic search (using embeddings) and full-text search (using BM25).
    Data is persisted to a DuckDB database file.

    Examples
    --------
    ```{python}
    #| eval: false
    from raghilda.store import DuckDBStore
    from raghilda.embedding import EmbeddingOpenAI

    # Create a new store with embeddings
    store = DuckDBStore.create(
        location="my_store.db",
        embed=EmbeddingOpenAI(),
    )

    # Insert a chunked document
    from raghilda.document import MarkdownDocument
    from raghilda.chunker import MarkdownChunker

    doc = MarkdownDocument(
        origin="https://example.com/doc1.md",
        content="# Example

This is a sample document.",
    )
    store.upsert(MarkdownChunker().chunk(doc))

    # Retrieve similar chunks
    chunks = store.retrieve("How do I use this?", top_k=5)
    ```
    

ChromaDBStore(client: 'Any', collection: 'Any', metadata: 'AttributesStoreMetadata')

A vector store backed by ChromaDB.

ChromaDBStore provides local vector storage using Chroma's embedded client.
Documents are chunked by raghilda and embeddings are generated by Chroma's
embedding function (defaults to Chroma's built-in embedding).

Examples
--------
```{python}
#| eval: false
from raghilda.store import ChromaDBStore

store = ChromaDBStore.create(location="raghilda_chroma", name="docs")

store.upsert(markdown_doc)
chunks = store.retrieve("hello world", top_k=3)
```

OpenAIStore(client: Any, store_id: str, *, attributes_spec: Optional[Mapping[str, raghilda._attribute_schema.AttributeSpec]] = None, attributes: Optional[Mapping[str, type[str] | type[int] | type[float] | type[bool] | raghilda._attribute_schema.AttributeFloatVectorType | raghilda._attribute_schema.AttributeStructType]] = None)

A vector store backed by OpenAI's Vector Store API.

OpenAIStore uses OpenAI's hosted vector storage service for document
storage and retrieval. Documents are uploaded as files and automatically
chunked and embedded by OpenAI.

Examples
--------
```{python}
#| eval: false
from raghilda.store import OpenAIStore

# Create a new store
store = OpenAIStore.create(name="my-store")

# Or connect to an existing store
store = OpenAIStore.connect(store_id="vs_abc123")

# Insert documents
from raghilda.document import MarkdownDocument
doc = MarkdownDocument(content="# Hello\nWorld", origin="example.md")
store.upsert(doc)

# Retrieve similar chunks
chunks = store.retrieve("greeting", top_k=5)
```

PostgreSQLStore(con: psycopg2.extensions.connection, metadata: dict, schema: str)

A store backed by a PostgreSQL database with pgvector.

Uses PostgreSQL for storage with two retrieval methods:

- **Full-text search** via :meth:`retrieve_fts`: uses PostgreSQL's
  built-in ``tsvector``/``tsquery`` with ``ts_rank`` for ranking.
  A pre-computed ``tsvector`` column with a GIN index is created
  automatically.
- **Vector similarity search** via :meth:`retrieve_vss`: uses
  pgvector for nearest-neighbor search over embeddings. An HNSW
  index for cosine distance is created automatically when an
  embedding provider is given. Use :meth:`build_index` to add
  indexes for other distance methods (L2, inner product).


## Embedding

Embedding providers for generating vector representations


EmbeddingProvider()

Interface for embedding function providers.

To create a custom embedding provider:

1. Subclass `EmbeddingProvider` and implement `embed()`, `get_config()`, and `from_config()`
2. Register it with `@register_embedding_provider("MyProvider")`

Registered providers are automatically restored when connecting to a database
that was created with that provider.

Examples
--------
```{python}
#| eval: false
from raghilda.embedding import EmbeddingProvider, register_embedding_provider

@register_embedding_provider("MyCustomEmbedding")
class MyCustomEmbedding(EmbeddingProvider):
    def __init__(self, model: str = "default", api_key: str | None = None):
        self.model = model
        self.api_key = api_key
        # Initialize your embedding client here

    def embed(self, x, input_type=None):
        # Return list of embedding vectors
        ...

    def get_config(self):
        # Return config dict (exclude sensitive values like api_key)
        return {"type": "MyCustomEmbedding", "model": self.model}

    @classmethod
    def from_config(cls, config):
        return cls(model=config.get("model", "default"))
```

EmbedInputType(*values)

Specifies the type of input being embedded.

Some embedding models (e.g., Cohere) produce different embeddings for queries
vs documents to optimize retrieval performance.

EmbeddingOpenAI(model: str = 'text-embedding-3-small', base_url: str = 'https://api.openai.com/v1', api_key: Optional[str] = None, batch_size: int = 20) -> None

Creates an embedding function provider backed by OpenAI's embedding models
Implements the [EmbeddingProvider](`raghilda.EmbeddingProvider`) interface.

Parameters
----------
model
    The OpenAI embedding model to use. Default is "text-embedding-3-small"
base_url
    The base URL for the OpenAI API. Default is "https://api.openai.com/v1".
api_key
    The API key for authenticating with OpenAI. If None, it will use the
    OPENAI_API_KEY environment variable if set.
batch_size
    The number of texts to process in each batch when calling the API.

Examples
--------
```{python}
#| eval: false
from raghilda.embedding import EmbeddingOpenAI

provider = EmbeddingOpenAI(model="text-embedding-3-small")
embeddings = provider.embed(["hello world", "testing embeddings"])
print(len(embeddings))
print(len(embeddings[0]))  # Dimension of the embedding
print(embeddings[0][:10])  # The embedding vector
```

EmbeddingCohere(model: str = 'embed-english-v3.0', api_key: Optional[str] = None, batch_size: int = 96) -> None

Creates an embedding function provider backed by Cohere's embedding models.
Implements the [EmbeddingProvider](`raghilda.EmbeddingProvider`) interface.

Cohere's embedding models produce different embeddings for queries vs documents
to optimize retrieval performance. Use `input_type=EmbedInputType.QUERY` when
embedding search queries and `input_type=EmbedInputType.DOCUMENT` (default)
when embedding documents for indexing.

Parameters
----------
model
    The Cohere embedding model to use. Default is "embed-english-v3.0".
api_key
    The API key for authenticating with Cohere. If None, it will use the
    CO_API_KEY environment variable if set.
batch_size
    The number of texts to process in each batch when calling the API.
    Cohere supports up to 96 texts per request.

Examples
--------
```{python}
#| eval: false
from raghilda.embedding import EmbeddingCohere, EmbedInputType

provider = EmbeddingCohere(model="embed-english-v3.0")

# Embed documents for indexing
doc_embeddings = provider.embed(
    ["Hello world", "Testing embeddings"],
    input_type=EmbedInputType.DOCUMENT
)

# Embed a query for search
query_embedding = provider.embed(
    ["How do I test embeddings?"],
    input_type=EmbedInputType.QUERY
)
```

EmbeddingSentenceTransformers(model: str = 'all-MiniLM-L6-v2', device: Optional[str] = None, batch_size: int = 64, prompts: Optional[dict[raghilda._embedding.EmbedInputType, str]] = None) -> None

Creates an embedding function provider backed by sentence-transformers models.
Implements the [EmbeddingProvider](`raghilda.EmbeddingProvider`) interface.

This provider runs models locally using the `sentence-transformers` library,
enabling offline/private embedding without external API calls.

Parameters
----------
model
    The sentence-transformers model to use. Default is "all-MiniLM-L6-v2".
    Any model from the Hugging Face Hub that is compatible with
    sentence-transformers can be used.
device
    The device to run the model on (e.g., "cpu", "cuda", "mps"). If None,
    sentence-transformers will auto-detect the best available device.
batch_size
    The number of texts to process in each batch.
prompts
    Optional mapping from `EmbedInputType` to a prefix string to prepend
    to each text before encoding. This is useful for models that require
    task-specific prefixes (e.g., nomic-embed-text uses "search_query: "
    and "search_document: ").

Examples
--------
Install raghilda with sentence-transformers support:

```bash
pip install raghilda[sentence-transformers]
```

```{python}
#| eval: false
from raghilda.embedding import EmbeddingSentenceTransformers

provider = EmbeddingSentenceTransformers(model="all-MiniLM-L6-v2")
embeddings = provider.embed(["hello world", "testing embeddings"])
print(len(embeddings))
print(len(embeddings[0]))  # Dimension of the embedding
```

For models that use task-specific prefixes:

```{python}
#| eval: false
from raghilda.embedding import EmbeddingSentenceTransformers, EmbedInputType

provider = EmbeddingSentenceTransformers(
    model="nomic-ai/nomic-embed-text-v1.5",
    prompts={
        EmbedInputType.QUERY: "search_query: ",
        EmbedInputType.DOCUMENT: "search_document: ",
    },
)
# Queries get "search_query: " prepended automatically
query_emb = provider.embed(["Who is Laurens van Der Maaten?"], EmbedInputType.QUERY)
# Documents get "search_document: " prepended automatically
doc_emb = provider.embed(["TSNE is a dimensionality reduction algorithm"])
```


## Chunker

Text chunking utilities for splitting documents


BaseChunker()

Base class for chunkers.

A chunker splits a :py:class:`raghilda.document.Document` into a
:py:class:`raghilda.document.ChunkedDocument` containing
smaller text segments suitable for embedding and retrieval.

Subclasses must implement :py:meth:`chunk` and :py:meth:`chunk_text`
to provide a concrete chunking strategy:

- :py:class:`raghilda.chunker.MarkdownChunker`: splits Markdown documents
  at semantic boundaries (headings, paragraphs, sentences).

MarkdownChunker(chunk_size: int = 1600, target_overlap: float = 0.5, *, max_snap_distance: int = 20, segment_by_heading_levels: Optional[list[int]] = None) -> None

Chunk Markdown documents into overlapping segments at semantic boundaries.

This chunker divides Markdown text into smaller, overlapping chunks while
intelligently positioning cut points at semantic boundaries like headings,
paragraphs, sentences, and words. Rather than cutting rigidly at character
counts, it nudges cut points to the nearest sensible boundary, producing
more semantically coherent chunks suitable for RAG applications.

Parameters
----------
chunk_size
    Target size for each chunk in characters. The chunker attempts to
    create chunks near this size, though actual sizes may vary based on
    semantic boundaries. Default is 1600 characters.
target_overlap
    Fraction of overlap between successive chunks, from 0 to 1.
    Default is 0.5 (50% overlap). Even with 0, some overlap may occur
    because the last chunk is anchored to the document end.
max_snap_distance
    Maximum distance (in characters) to move a cut point to reach a
    semantic boundary. If no boundary is found within this distance,
    the cut point stays at its original position. Default is 20.
segment_by_heading_levels
    List of heading levels (1-6) that act as hard boundaries. When
    specified, no chunk will cross these headings, and segments between
    them are chunked independently. For example, `[1, 2]` ensures chunks
    never span across h1 or h2 headings.

Examples
--------
```{python}
from raghilda.chunker import MarkdownChunker

chunker = MarkdownChunker(
    chunk_size=100,
    target_overlap=0.2,
    segment_by_heading_levels=[1, 2],
)

text = '''# Introduction
This is the introduction section with some content.

## Background
Here is background information that provides context.

## Methods
The methods section describes our approach.
'''

chunks = chunker.chunk_text(text)
for chunk in chunks:
    print(f"[{chunk.start_index}:{chunk.end_index}] {chunk.text[:40]}...")
```

Notes
-----
The chunking algorithm works as follows:

1. Parse the Markdown to identify semantic boundaries (headings,
   paragraphs, sentences, lines, words)
2. If `segment_by_heading_levels` is set, split the document at those
   headings first
3. For each segment, calculate target chunk boundaries based on
   `chunk_size` and `target_overlap`
4. Snap each boundary to the nearest semantic boundary (preferring
   headings > paragraphs > sentences > lines > words)
5. Extract chunks with their positional information and heading context


## Utilities

Utility functions for reading and scraping content


read_as_markdown(uri: str, html_extract_selectors: Optional[list[str]] = None, html_zap_selectors: Optional[list[str]] = None, *args, **kwargs) -> raghilda.document.MarkdownDocument

Read a markdown file from a URI and return its content as a string.

Parameters
----------
uri
    The URI of the markdown file to read. Supported schemes are:

    - path/to/file.md
    - http://example.com/file.md
    - https://example.com/file.md

html_extract_selectors
    A list of CSS selectors to extract specific parts of the HTML content
    when the URI points to an HTML page. Defaults to ['main'].
html_zap_selectors
    A list of CSS selectors to remove specific parts of the HTML content
    when the URI points to an HTML page. Defaults to ['nav'].

Returns
-------
MarkdownDocument
    The content of the markdown file as a MarkdownDocument object.

Examples
--------
```{python}
#| eval: false
from raghilda.read import read_as_markdown

# Read from a local file
md_content = read_as_markdown("path/to/file.md")
print(md_content)

# Read from an HTTP URL
md_content = read_as_markdown("https://raw.githubusercontent.com/user/repo/branch/file.md")
print(md_content)
```

find_links(x: 'str | Path | Sequence[str | Path]', depth: 'int' = 0, children_only: 'bool' = False, progress: 'bool' = True, *, url_filter: 'Callable[[set[str]], list[str]] | None' = None, validate: 'bool' = False, **request_kwargs: 'Any') -> 'list[str]'

Discover hyperlinks starting from one or many documents and return them as URLs.

Parameters
----------
x
    Starting URL(s). Accepts strings or paths; inputs must expand to HTTP(S)
    URLs.
depth
    Maximum traversal depth from each starting document. ``0`` inspects the
    starting pages only, ``1`` also inspects their direct children, and so on.
children_only
    When ``True``, only links that stay under the originating host are
    returned and traversed.
progress
    Whether to display a progress bar while traversing links. Falls back to
    a no-op when :mod:`tqdm` is not available.
url_filter
    Receives a list of URL's and decides returns a list of urls that should
    be kept. POssibly smaller.
validate
    When ``True``, perform a lightweight validation to ensure targets are
    reachable before including them in the results.
request_kwargs
    Additional keyword arguments forwarded to :func:`requests.Session.get`
    (and ``head`` during validation) when fetching HTTP resources.

Returns
-------
Iterator[str]
    Yields absolute link targets, deduplicated and ordered as discovered.


## Chunk

Chunk data types


Chunk(text: str, start_index: int, end_index: int, char_count: int, context: Optional[str] = None, origin: Optional[str] = None, attributes: Optional[dict[str, Any]] = None) -> None

A segment of text extracted from a document.

Chunks are the fundamental unit for retrieval in RAG applications.
Each chunk contains the text content along with positional information
that allows mapping back to the original document.

Attributes
----------
text
    The actual text content of the chunk.
start_index
    Character position where this chunk begins in the source document.
end_index
    Character position where this chunk ends in the source document.
char_count
    Number of characters in this chunk.
context
    Optional heading context showing the document hierarchy at this
    chunk's position (e.g., the Markdown headings that apply).
origin
    Origin of the parent document this chunk belongs to.
attributes
    Optional user-defined attributes associated with the chunk. These
    attributes can be used for retrieval filtering/scoping and downstream
    prompt/context construction.

MarkdownChunk(text: str, start_index: int, end_index: int, char_count: int, context: Optional[str] = None, origin: Optional[str] = None, attributes: Optional[dict[str, Any]] = None) -> None

A chunk extracted from a Markdown document.

MarkdownChunk extends Chunk for use with Markdown content.
It typically preserves heading context from the source document,
allowing retrieval results to show where in the document hierarchy
each chunk originated.

RetrievedChunk(text: str, start_index: int, end_index: int, char_count: int, context: Optional[str] = None, origin: Optional[str] = None, attributes: Optional[dict[str, Any]] = None, metrics: list[raghilda.chunk.Metric] = <factory>, chunk_ids: list[int] = <factory>) -> None

A chunk returned from a retrieval operation with associated metrics.

RetrievedChunk extends Chunk with retrieval metrics that indicate
how well the chunk matched the query. Common metrics include
similarity scores and BM25 scores.

Attributes
----------
metrics
    List of Metric objects containing retrieval scores.
chunk_ids
    Backend chunk identifiers represented by this retrieved chunk.
    For non-deoverlapped results this usually contains one id. For
    deoverlapped chunks it may include multiple source chunk ids.

Examples
--------
```{python}
from raghilda.chunk import RetrievedChunk, Metric

chunk = RetrievedChunk(
    text="This is relevant content.",
    start_index=0,
    end_index=25,
    char_count=25,
    metrics=[
        Metric(name="similarity", value=0.92),
        Metric(name="bm25_score", value=15.3),
    ],
)

for metric in chunk.metrics:
    print(f"{metric.name}: {metric.value}")
```

Metric(name: str, value: float) -> None

A named metric value associated with a retrieved chunk.

Metrics are used to store retrieval scores and other measurements
that describe how well a chunk matches a query.

Attributes
----------
name
    The name of the metric (e.g., "similarity", "bm25_score").
value
    The numeric value of the metric.

Examples
--------
```{python}
from raghilda.chunk import Metric

similarity = Metric(name="similarity", value=0.95)
print(f"{similarity.name}: {similarity.value}")
```


## Document

Document types for unchunked and chunked content


Document(content: 'str', origin: 'Optional[str]' = None, attributes: 'Optional[dict[str, Any]]' = None) -> None

A document containing text content to be chunked and indexed.

Documents are the primary input for RAG stores. Each document has
text content and an optional origin identifier.

Attributes
----------
content
    The full text content of the document.
origin
    Unique origin identifier for the document. This can be None or an empty
    string while preparing a document object, but stores require a populated
    origin for upsert operations.
attributes
    Optional user-defined attributes applied at document insertion time.
    Document-level attributes can be inherited by chunks and returned
    during retrieval for filtering and downstream prompt/context use.

ChunkedDocument(content: 'str', origin: 'Optional[str]' = None, attributes: 'Optional[dict[str, Any]]' = None, *, chunks: 'list[Chunk]') -> None

A document with an attached sequence of chunks.

This is the explicit chunked variant of `Document`, used by stores and
chunkers that operate on pre-segmented content.

MarkdownDocument(content: 'str', origin: 'Optional[str]' = None, attributes: 'Optional[dict[str, Any]]' = None) -> None

A Markdown document with source tracking.

MarkdownDocument extends Document with markdown-specific semantics for
content that comes from a source origin (e.g., URL or file path). This is useful
for citation and provenance tracking in RAG applications.

Examples
--------
```{python}
from raghilda.document import MarkdownDocument

# Create from content directly
doc = MarkdownDocument(
    content="# Hello World\n\nThis is a test document.",
    origin="https://example.com/hello.md",
)
print(f"Document from: {doc.origin}")
print(f"Content length: {len(doc.content)} characters")
```

ChunkedMarkdownDocument(content: 'str', origin: 'Optional[str]' = None, attributes: 'Optional[dict[str, Any]]' = None, *, chunks: 'list[Chunk]') -> None

A Markdown document with an attached sequence of chunks.


## Types

Protocol types for type checking compatibility


ChunkLike(*args, **kwargs)

Any chunk-like object (chonkie, raghilda, or custom).

ChunkedDocumentLike(*args, **kwargs)

Any chunked document-like object.

DocumentLike(*args, **kwargs)

Any document-like object.

ChunkerLike(*args, **kwargs)

Any chunker-like object (chonkie, raghilda, or custom).

IntoChunk(*args, **kwargs)

Any object that can be converted into a Chunk via to_chunk().

IntoDocument(*args, **kwargs)

Any object that can be converted into a Document via to_document().


----------------------------------------------------------------------
This is the User Guide documentation for the package.
----------------------------------------------------------------------


## Getting Started

### Core Concepts

Large language models (LLMs) sometimes generate confident but incorrect information—a phenomenon known as hallucination. This happens because LLMs work by predicting the most likely next words based on patterns learned during training, without any inherent concept of truth or factual accuracy.

## Why RAG?

Retrieval-Augmented Generation (RAG) addresses this by grounding LLM responses in trusted source material. Instead of relying solely on the model's training data, RAG retrieves relevant content from a curated knowledge base and includes it in the prompt. This shifts the model's role from open-ended generation to summarizing vetted content.

While RAG doesn't eliminate hallucinations entirely, it significantly reduces them for domain-specific applications by ensuring responses are anchored in authoritative sources.

## Building a RAG System

A RAG system has two main phases:

1. **Preparation**: Building a searchable knowledge store from your documents
2. **Retrieval**: Finding relevant content to augment LLM prompts

Let's walk through building a RAG system using the [Quarto documentation](https://quarto.org/docs/guide/) as our knowledge base.

## Creating a Store

First, create a store with an embedding provider. The store will hold your document chunks and their vector embeddings:

```{python}
from raghilda.store import DuckDBStore
from raghilda.embedding import EmbeddingOpenAI

store = DuckDBStore.create(
    location="quarto_docs.db",
    embed=EmbeddingOpenAI(),
    name="quarto",
    title="Quarto Documentation",
    overwrite=True,
)
```

raghilda supports multiple embedding providers (OpenAI, Cohere) and storage backends (DuckDB, ChromaDB, OpenAI Vector Stores, PostgreSQL). See the [API Reference](/reference/index.qmd) for all options.

## Finding Documents

Next, identify the documents to include. The `find_links()` function can crawl a website to discover pages:

```{python}
from raghilda.scrape import find_links

links = find_links(
    "https://quarto.org/docs/guide/",
    depth=1,  # follow links 1 level deep from the starting page
    children_only=True,
)
print(f"Found {len(links)} pages")
```

The `depth` parameter controls how many levels of links to follow, and `children_only=True` restricts crawling to pages under the starting URL.

You can also work with local files or provide a list of URLs directly:

```{python}
#| eval: false
# Local files
links = ["docs/guide.md", "docs/reference.md", "docs/tutorial.md"]

# Or use glob patterns with pathlib
from pathlib import Path
links = list(Path("docs").glob("**/*.md"))
```

## Preparing Documents

Prepare each document explicitly by reading it, chunking it, and passing the result to `upsert()`:

```{python}
from raghilda.read import read_as_markdown
from raghilda.chunker import MarkdownChunker

chunker = MarkdownChunker()

for link in links:
    document = read_as_markdown(link)
    chunked = chunker.chunk(document)
    store.upsert(chunked)

print(f"Indexed {store.size()} documents")
```

That is the full preparation phase. Each document is converted to Markdown, split into overlapping chunks, embedded, and written to the store through explicit calls that keep the indexing pipeline visible.

## What Happens During Preparation

Each item you index typically goes through two steps before it is stored:

**1. Convert to Markdown** — `read_as_markdown()` converts the item (a URL or file path) into a Markdown document. It handles HTML pages, PDFs, DOCX files, and more using [MarkItDown](https://github.com/microsoft/markitdown). For HTML, it extracts the `<main>` element and removes `<nav>` elements by default.

```{python}
from raghilda.read import read_as_markdown

doc = read_as_markdown("https://quarto.org/docs/guide/")
print(doc.content[:500])
```

**2. Chunk the document** — `MarkdownChunker` splits the Markdown into overlapping chunks at semantic boundaries (headings, paragraphs, sentences). The defaults are a chunk size of 1600 characters with 50% overlap between chunks. Each chunk retains the heading hierarchy it falls under as context.

```{python}
from raghilda.chunker import MarkdownChunker

chunker = MarkdownChunker(
    chunk_size=1600,      # Target size in characters
    target_overlap=0.5,   # 50% overlap between chunks
)
chunked_doc = chunker.chunk(doc)
print(f"Created {len(chunked_doc.chunks)} chunks")
print(f"\nFirst chunk context: {chunked_doc.chunks[0].context}")
print(f"First chunk text:\n{chunked_doc.chunks[0].text[:200]}...")
```

After chunking, `upsert()` embeds the chunks using the store's embedding provider and writes them to the database.

## Customizing Preparation

You can wrap your preferred reading and chunking logic in a helper that returns a chunked `Document`:

```{python}
#| eval: false
from raghilda.read import read_as_markdown
from raghilda.chunker import MarkdownChunker

chunker = MarkdownChunker(chunk_size=800, target_overlap=0.3)

def prepare(uri):
    doc = read_as_markdown(uri)
    return chunker.chunk(doc)

for link in links:
    store.upsert(prepare(link))
```

Common reasons to customize `prepare`:

- **Adjust chunk size or overlap** — Smaller chunks for more precise retrieval, larger for more context.
- **Set hard heading boundaries** — Use `segment_by_heading_levels=[1, 2]` to prevent chunks from crossing major sections.
- **Control HTML extraction** — Pass `html_extract_selectors` or `html_zap_selectors` to `read_as_markdown()`.
- **Use a different chunker** — Any chunker that returns a chunked document will work. See the [Chunking](02-chunking.qmd) guide for more options, including [chonkie](https://github.com/chonkie-ai/chonkie) integration.

## Building Indexes

After ingestion, build indexes to speed up retrieval:

```{python}
store.build_index()
```

This creates both a vector similarity index (HNSW) for semantic search and a BM25 index for keyword search.

## Retrieving Content

Now you can search your knowledge base:

```{python}
chunks = store.retrieve("How do I create a Quarto presentation?", top_k=5)

for chunk in chunks:
    print(f"Score: {chunk.metrics[0].value:.4f}")
    print(chunk.text[:200])
    print("---")
```

The `retrieve()` method combines vector similarity search (semantic matching) with BM25 (keyword matching) for hybrid retrieval. By default, overlapping chunks from the same document are merged (`deoverlap=True`) to provide more coherent results; metrics are concatenated and attribute values are aggregated in start-order lists on merged chunks. The merged chunk keeps `context` from the first overlapping chunk.

You can also use the individual search methods:

```{python}
#| eval: false
# Vector similarity search only
chunks = store.retrieve_vss("presentations", top_k=5)

# BM25 keyword search only
chunks = store.retrieve_bm25("presentations", top_k=5)
```

## Using with an LLM

The retrieved chunks can augment your LLM prompts. Here's an example using [chatlas](https://posit-dev.github.io/chatlas/):

```{python}
#| eval: false
from chatlas import ChatOpenAI

# Connect to existing store
store = DuckDBStore.connect("quarto_docs.db", read_only=True)

# Define a search tool
def search_docs(query: str) -> str:
    """Search the Quarto documentation for relevant information."""
    import json
    chunks = store.retrieve(query, top_k=5, deoverlap=True)
    return json.dumps([{"text": chunk.text, "context": chunk.context} for chunk in chunks])

# Create chat with RAG tool
chat = ChatOpenAI(
    model="gpt-4o-mini",
    system_prompt="""Answer questions about Quarto using the search tool.
Always search the documentation before answering.""",
)
chat.register_tool(search_docs)

# Ask a question
chat.chat("How do I add citations to a Quarto document?")
```

## Reconnecting to a Store

To reuse an existing store:

```{python}
#| eval: false
store = DuckDBStore.connect("quarto_docs.db")
print(f"Store contains {store.size()} chunks")
```

The embedding configuration is automatically restored, so you can immediately start retrieving.

## Complete Example

Here's the full workflow in one script:

```{python}
#| eval: false
from raghilda.store import DuckDBStore
from raghilda.embedding import EmbeddingOpenAI
from raghilda.scrape import find_links

# 1. Create store
store = DuckDBStore.create(
    location="quarto_docs.db",
    embed=EmbeddingOpenAI(),
    name="quarto",
    title="Quarto Documentation",
    overwrite=True,
)

# 2. Find documents
links = find_links(
    "https://quarto.org/docs/guide/",
    depth=1,
    children_only=True,
)

# 3. Prepare and insert documents
from raghilda.read import read_as_markdown
from raghilda.chunker import MarkdownChunker

chunker = MarkdownChunker()
for link in links:
    store.upsert(chunker.chunk(read_as_markdown(link)))

# 4. Build indexes
store.build_index()

# 5. Retrieve
chunks = store.retrieve("How do I create a presentation?", top_k=3)
for chunk in chunks:
    print(f"\n## {chunk.context}")
    print(chunk.text[:300])
```

## Next Steps

- Learn about chunk sizing, overlap, and alternative chunkers in the [Chunking](02-chunking.qmd) guide
- Learn schema-based retrieval scoping with [Attribute Filters](03-attribute-filters.qmd)
- Explore [ChromaDB Store](50-chromadb-store.qmd) for an alternative storage backend
- See the [API Reference](/reference/index.qmd) for detailed documentation
- Check out the [examples](https://github.com/dfalbel/py-ragnar/tree/main/examples) for more use cases


### Custom Embedding Providers

raghilda includes built-in support for OpenAI and Cohere embeddings, but you can create custom providers for other embedding services or local models.

## The EmbeddingProvider Interface

All embedding providers implement the `EmbeddingProvider` abstract base class with three required methods:

- `embed(x, input_type)` — Generate embeddings for a sequence of texts
- `get_config()` — Return a configuration dict for serialization
- `from_config(config)` — Create an instance from a configuration dict

## Creating a Custom Provider

Here's a complete example using the Voyage AI embedding API:

```{python}
#| eval: false
from typing import Any, Sequence
import voyageai
from raghilda.embedding import (
    EmbeddingProvider,
    EmbedInputType,
    register_embedding_provider,
)

@register_embedding_provider("EmbeddingVoyage")
class EmbeddingVoyage(EmbeddingProvider):
    """Embedding provider using Voyage AI models."""

    def __init__(
        self,
        model: str = "voyage-3",
        api_key: str | None = None,
        batch_size: int = 128,
    ):
        self.model = model
        self.api_key = api_key
        self.batch_size = batch_size
        self.client = voyageai.Client(api_key=api_key)

    def embed(
        self,
        x: Sequence[str],
        input_type: EmbedInputType = EmbedInputType.DOCUMENT,
    ) -> Sequence[Sequence[float]]:
        if isinstance(x, str):
            raise TypeError("Input must be a sequence of strings")

        if len(x) == 0:
            return []

        # Map to Voyage's input types
        voyage_input_type = (
            "query" if input_type == EmbedInputType.QUERY else "document"
        )

        result = []
        for i in range(0, len(x), self.batch_size):
            batch = list(x[i : i + self.batch_size])
            response = self.client.embed(
                texts=batch,
                model=self.model,
                input_type=voyage_input_type,
            )
            result.extend(response.embeddings)

        return result

    def get_config(self) -> dict[str, Any]:
        return {
            "type": "EmbeddingVoyage",  # Must match the registered name
            "model": self.model,
            "batch_size": self.batch_size,
            # Never include api_key in config for security
        }

    @classmethod
    def from_config(cls, config: dict[str, Any]) -> "EmbeddingVoyage":
        return cls(
            model=config.get("model", "voyage-3"),
            batch_size=config.get("batch_size", 128),
        )
```

## Registration

The `@register_embedding_provider("EmbeddingVoyage")` decorator registers your provider in a global registry. This enables automatic restoration when reconnecting to a store:

```{python}
#| eval: false
from raghilda.store import DuckDBStore
from raghilda.chunker import MarkdownChunker
from raghilda.read import read_as_markdown

# Create store with custom provider
store = DuckDBStore.create(
    location="my_store.db",
    embed=EmbeddingVoyage(model="voyage-3"),
)
chunker = MarkdownChunker()
for uri in documents:
    store.upsert(chunker.chunk(read_as_markdown(uri)))

# Later, reconnect - provider is automatically restored
store = DuckDBStore.connect("my_store.db")
# store.embed is now an EmbeddingVoyage instance
```

The registered name in the decorator must match the `"type"` value in `get_config()`.

## Configuration Serialization

The `get_config()` and `from_config()` methods handle serialization:

- `get_config()` returns a dict with all parameters needed to recreate the provider
- `from_config()` creates a new instance from that dict
- **Never include API keys** in the config — they should come from environment variables

When you connect to an existing store, raghilda reads the stored config and calls `from_config()` to recreate the provider.

## Input Types

The `EmbedInputType` enum distinguishes between queries and documents:

```{python}
#| eval: false
from raghilda.embedding import EmbedInputType

# Embedding documents for storage
doc_embeddings = provider.embed(documents, input_type=EmbedInputType.DOCUMENT)

# Embedding a query for search
query_embedding = provider.embed([query], input_type=EmbedInputType.QUERY)
```

Some models (like Cohere and Voyage) produce different embeddings for queries vs documents to optimize retrieval. Others (like OpenAI) ignore this parameter. Your provider should handle both cases appropriately.

## Local Models Example

Here's an example using a local model with sentence-transformers:

```{python}
#| eval: false
from typing import Any, Sequence
from sentence_transformers import SentenceTransformer
from raghilda.embedding import (
    EmbeddingProvider,
    EmbedInputType,
    register_embedding_provider,
)

@register_embedding_provider("EmbeddingLocal")
class EmbeddingLocal(EmbeddingProvider):
    """Embedding provider using local sentence-transformers models."""

    def __init__(self, model: str = "all-MiniLM-L6-v2"):
        self.model_name = model
        self._model = SentenceTransformer(model)

    def embed(
        self,
        x: Sequence[str],
        input_type: EmbedInputType = EmbedInputType.DOCUMENT,
    ) -> Sequence[Sequence[float]]:
        if isinstance(x, str):
            raise TypeError("Input must be a sequence of strings")

        if len(x) == 0:
            return []

        # sentence-transformers doesn't distinguish query vs document
        embeddings = self._model.encode(list(x))
        return [emb.tolist() for emb in embeddings]

    def get_config(self) -> dict[str, Any]:
        return {
            "type": "EmbeddingLocal",
            "model": self.model_name,
        }

    @classmethod
    def from_config(cls, config: dict[str, Any]) -> "EmbeddingLocal":
        return cls(model=config.get("model", "all-MiniLM-L6-v2"))
```

Usage:

```{python}
#| eval: false
from raghilda.store import DuckDBStore

store = DuckDBStore.create(
    location="local_store.db",
    embed=EmbeddingLocal(model="all-MiniLM-L6-v2"),
)
```

## ChromaDB Compatibility

Custom providers work with `ChromaDBStore` without any extra method. Pass the
provider to `ChromaDBStore.create()`, and raghilda adapts it internally for
Chroma.

If your provider does not map to a native Chroma embedding function, raghilda
uses the provider's regular `embed()` implementation. That path is Python-only,
so cross-language Chroma clients cannot restore it.

## Best Practices

1. **Validate inputs** — Check for empty strings and wrong types in `embed()`
2. **Batch efficiently** — Process texts in batches to avoid API limits and improve performance
3. **Handle errors gracefully** — Provide clear error messages for common issues
4. **Omit secrets from config** — Never store API keys; use environment variables
5. **Test round-trip serialization** — Ensure `from_config(provider.get_config())` produces an equivalent provider

## Example: Testing Your Provider

```{python}
#| eval: false
def test_provider_round_trip():
    # Create provider
    original = EmbeddingVoyage(model="voyage-3", batch_size=64)

    # Serialize and deserialize
    config = original.get_config()
    restored = EmbeddingVoyage.from_config(config)

    # Verify
    assert restored.model == original.model
    assert restored.batch_size == original.batch_size

def test_embedding_output():
    provider = EmbeddingVoyage()
    texts = ["Hello world", "Testing embeddings"]

    embeddings = provider.embed(texts)

    assert len(embeddings) == 2
    assert all(isinstance(emb, (list, tuple)) for emb in embeddings)
    assert all(isinstance(v, float) for emb in embeddings for v in emb)
```


### Chunking

Chunking is the process of splitting documents into smaller pieces for embedding and retrieval. Good chunking improves retrieval quality by ensuring each chunk contains a coherent, self-contained piece of information.

## Why Chunking Matters

Embedding models have token limits, and even when they don't, shorter texts tend to produce more focused embeddings. A document about multiple topics will have an embedding that averages across all of them, making it less likely to match specific queries.

Chunking also affects what gets returned to the LLM. Smaller chunks mean more precise retrieval, but chunks that are too small may lack context.

## The MarkdownChunker

raghilda's `MarkdownChunker` splits Markdown documents at semantic boundaries rather than arbitrary character positions:

```{python}
#| eval: false
from raghilda.chunker import MarkdownChunker

chunker = MarkdownChunker(
    chunk_size=1600,      # characters
    target_overlap=0.5,   # 50% overlap
)

chunks = chunker.chunk_text(markdown_text)
```

### Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `chunk_size` | 1600 | Target size in characters |
| `target_overlap` | 0.5 | Fraction of overlap between chunks (0 to 1) |
| `max_snap_distance` | 20 | Max distance to move cut point to reach a boundary |
| `segment_by_heading_levels` | None | Heading levels that act as hard boundaries |

### Semantic Boundaries

The chunker identifies boundaries in order of preference:

1. **Headings** — `# H1`, `## H2`, etc.
2. **Paragraphs** — Blank lines between text blocks
3. **Sentences** — Periods, exclamation marks, question marks followed by whitespace
4. **Lines** — Newline characters
5. **Words** — Whitespace between words

When calculating where to cut, the chunker "snaps" to the nearest semantic boundary within `max_snap_distance` characters, preferring higher-priority boundaries.

### Overlap

Overlap helps ensure relevant content isn't lost at chunk boundaries. With 50% overlap, consecutive chunks share half their content:

```
Chunk 1: [====================]
Chunk 2:          [====================]
Chunk 3:                    [====================]
```

This redundancy means a piece of information near a boundary will appear in multiple chunks, increasing the chance of retrieval.

### Hard Boundaries with Heading Levels

Use `segment_by_heading_levels` to prevent chunks from crossing major section boundaries:

```{python}
#| eval: false
chunker = MarkdownChunker(
    chunk_size=1600,
    segment_by_heading_levels=[1, 2],  # Never cross h1 or h2
)
```

This ensures chunks stay within their logical sections, which is useful for documents with distinct topics under each heading.

### Heading Context

Each chunk includes the heading hierarchy it falls under:

```{python}
#| eval: false
for chunk in chunks:
    if chunk.context:
        print(f"Context: {chunk.context}")
    print(f"Text: {chunk.text[:100]}...")
```

This context is stored alongside the chunk and can help the LLM understand where the content came from.

## Chunking Documents

To chunk a full document, use `chunk()`:

```{python}
#| eval: false
from raghilda.read import read_as_markdown
from raghilda.chunker import MarkdownChunker

doc = read_as_markdown("article.md")
chunker = MarkdownChunker(chunk_size=800)

chunked_doc = chunker.chunk(doc)
print(f"Created {len(chunked_doc.chunks)} chunks")
```

`chunk()` returns a chunked document object, preserving the original document fields alongside its `chunks`.

## Custom Chunking Before Upsert

Prepare documents explicitly before calling `upsert()` when you want full control over chunking:

```{python}
#| eval: false
from raghilda.store import DuckDBStore
from raghilda.read import read_as_markdown
from raghilda.chunker import MarkdownChunker

store = DuckDBStore.create(location="store.db", embed=embed)

# Custom chunker with smaller chunks
chunker = MarkdownChunker(
    chunk_size=800,
    target_overlap=0.3,
    segment_by_heading_levels=[1],
)

def prepare(uri):
    doc = read_as_markdown(uri)
    return chunker.chunk(doc)

for uri in files:
    store.upsert(prepare(uri))
```

## Using Chonkie Chunkers

raghilda is compatible with [chonkie](https://github.com/chonkie-ai/chonkie), a library providing various chunking strategies. Any chonkie chunker can be used with raghilda:

```{python}
#| eval: false
from chonkie import TokenChunker, SemanticChunker
from raghilda.chunk import Chunk
from raghilda.document import MarkdownDocument
from raghilda.read import read_as_markdown

# Use chonkie's TokenChunker
chunker = TokenChunker(chunk_size=512, chunk_overlap=128)

def prepare(uri):
    doc = read_as_markdown(uri)
    chonkie_chunks = chunker.chunk(doc.content)
    # Convert chonkie chunks to raghilda chunks
    return MarkdownDocument(
        content=doc.content,
        origin=uri,
    ).to_chunked([Chunk.from_any(c) for c in chonkie_chunks])

for uri in files:
    store.upsert(prepare(uri))
```


### Attribute Filters

RAG depends on retrieving relevant context, but real stores are usually broad
and mixed, so most chunks are irrelevant for a single question. Raghilda
supports scoped retrieval using chunk attributes: you can attach user-defined attributes to chunks and
use them to limit results at query time with a small SQL-like filter language.

To make this concrete, this walkthrough builds a small museum collection
assistant. Imagine a chat tool used by staff or visitors to ask questions about
artifacts in a collection. Each artifact has an ID (for example `A1001`), and
documents in the store can include catalog text, conservation notes, internal
research, and gallery-label content. In that setting, retrieval should usually
be scoped to the artifact being discussed.

Raghilda defaults to hybrid retrieval (semantic + lexical) because each mode
has blind spots:

- lexical search can miss paraphrased or semantically related wording,
- vector search can retrieve plausible text from the wrong artifact.

Attributes are designed to solve this scope problem.
You define a schema for chunk attributes (for example `artifact_id`, `note_type`,
`priority`, or review flags), then filter at query time using SQL-like
expressions or a structured AST.

This same pattern is common in data and agent workflows where retrieval must be
entity-scoped, such as customer records, product catalogs, or case timelines.

The examples below are intentionally didactic and build in steps:

- Section 1 (Simple): top-level attributes and one simple filter.
- Section 2 (Intermediate): schema defaults and multi-condition policy filters.
- Section 3 (Advanced): nested attributes and dot-path filtering.

Each section is independent and uses an in-memory store so you can run it quickly.
Section 1 uses a dictionary schema because it is compact and easy to read.
Section 2 switches to a class-based schema to show the alternative style.

Raghilda supports both at any time: dictionaries for quick/simple schemas, or
classes (including dataclasses or plain annotated classes) when that better
fits your integration or code organization. Both support the same feature set;
the choice is mostly workflow and preference.

If you are learning this for the first time, run Section 1 first, then add the
ideas from Sections 2 and 3.

## 1) Simple example: top-level attributes

Start with the minimum setup: attach `artifact_id` and `note_type` to each chunk,
then retrieve with and without an artifact-level filter.

```{python}
from raghilda.store import DuckDBStore
from raghilda.document import MarkdownDocument
from raghilda.chunker import MarkdownChunker


def show_results(title, chunks):
    print(f"\n{title}")
    if not chunks:
        print("  (no matches)")
        return
    for idx, chunk in enumerate(chunks, start=1):
        print(f"  {idx}. text={chunk.text!r}")
        print(f"     attributes={chunk.attributes}")


simple_store = DuckDBStore.create(
    location=":memory:",
    embed=None,
    attributes={"artifact_id": str, "note_type": str},
)

chunker = MarkdownChunker()

for doc in [
    MarkdownDocument(
        origin="artifact_a1001_catalog.md",
        content="Catalog entry: Bronze owl statue from the Hellenistic period.",
        attributes={"artifact_id": "A1001", "note_type": "catalog"},
    ),
    MarkdownDocument(
        origin="artifact_a2042_restoration.md",
        content="Restoration note: ceramic bowl rim repaired with reversible adhesive.",
        attributes={"artifact_id": "A2042", "note_type": "restoration"},
    ),
    MarkdownDocument(
        origin="artifact_a1001_conservation.md",
        content="Conservation update: bronze surface stable, no active corrosion.",
        attributes={"artifact_id": "A1001", "note_type": "conservation"},
    ),
]:
    simple_store.upsert(chunker.chunk(doc))

simple_store.build_index("bm25")

show_results(
    "No filter",
    simple_store.retrieve("bronze", top_k=3, deoverlap=False),
)
show_results(
    "Filter by artifact_id (SQL-like)",
    simple_store.retrieve(
        "bronze",
        top_k=3,
        deoverlap=False,
        attributes_filter="artifact_id = 'A1001'",
    ),
)
```

## 2) Slightly more complex: defaults and multi-condition filtering

Now we add a slightly richer schema to mirror real ingestion.
Some attributes are optional because not every chunk type carries the same
attributes, and this keeps ingestion flexible across labels, catalog entries, and
internal notes.

This time we also define a class to keep track of attributes.
Providing a class is very similar to providing a dictionary schema: it is mostly
syntactic sugar for the same attribute definition. A class can be a better fit
when you want schema definitions to live in Python types.

What happens when an optional attribute is missing?

- During ingestion, omitted optional fields are stored as (SQL) `NULL`.
- A chunk with `NULL` in a field can still be retrieved normally when no filter
  depends on that field.
- If a filter requires a concrete value (for example
  `gallery_room = 'Gallery 2'`), chunks where `gallery_room` is `NULL` do not
  match that predicate.

In the data below, the internal condition report intentionally omits
`gallery_room` so you can see this behavior in practice.
If you want to include missing values in a filter, use explicit null checks such
as `gallery_room IS NULL` (or `gallery_room IS NOT NULL`) in your expression.

```{python}
from raghilda.store import DuckDBStore
from raghilda.document import MarkdownDocument
from raghilda.chunker import MarkdownChunker


class MuseumArtifactAttributes:
    artifact_id: str
    note_type: str | None = None
    priority: int = 0
    gallery_room: str | None = None


def show_results(title, chunks):
    print(f"\n{title}")
    if not chunks:
        print("  (no matches)")
        return
    for idx, chunk in enumerate(chunks, start=1):
        print(f"  {idx}. {chunk.text.strip()!r}")
        print(f"     attributes={chunk.attributes}")


store = DuckDBStore.create(
    location=":memory:",
    embed=None,
    attributes=MuseumArtifactAttributes,
)
chunker = MarkdownChunker()

for doc in [
    MarkdownDocument(
        origin="a1001_gallery_label.md",
        content="Gallery label: Bronze owl statue likely used in ceremonial contexts.",
        attributes={
            "artifact_id": "A1001",
            "note_type": "label",
            "priority": 10,
            "gallery_room": "Gallery 2",
        },
    ),
    MarkdownDocument(
        origin="a1001_internal_condition.md",
        content="Internal condition report: micro-pitting near base, monitor humidity.",
        attributes={
            "artifact_id": "A1001",
            "note_type": "condition_report",
            "priority": 2,
            # gallery_room intentionally omitted (stored as NULL)
        },
    ),
    MarkdownDocument(
        origin="a2042_gallery_label.md",
        content="Gallery label: decorated ceramic bowl with geometric motifs.",
        attributes={
            "artifact_id": "A2042",
            "note_type": "label",
            "priority": 8,
            "gallery_room": "Gallery 5",
        },
    ),
]:
    store.upsert(chunker.chunk(doc))

store.build_index("bm25")

show_results(
    "AST filter (artifact_id + minimum priority + gallery room)",
    store.retrieve(
        "bronze owl",
        top_k=3,
        deoverlap=False,
        attributes_filter={
            "type": "and",
            "filters": [
                {"type": "eq", "key": "artifact_id", "value": "A1001"},
                {"type": "gte", "key": "priority", "value": 5},
                {"type": "eq", "key": "gallery_room", "value": "Gallery 2"},
            ],
        },
    ),
)

show_results(
    "Same policy with SQL-like filter",
    store.retrieve(
        "bronze owl",
        top_k=3,
        deoverlap=False,
        attributes_filter="""
          artifact_id = 'A1001'
          AND priority >= 5
          AND gallery_room = 'Gallery 2'
        """,
    ),
)
```

## 3) Complex: nested attributes and dot-path filtering

Finally, we step up to nested attributes and show how to filter them with dot-path
keys (for example source system, curation team, and review flags).

```{python}
from typing import Annotated

from raghilda.store import DuckDBStore
from raghilda.document import MarkdownDocument
from raghilda.chunker import MarkdownChunker

COMPLEX_SCHEMA = {
    "artifact_id": str,
    "note_type": str,
    "priority": int,
    "embedding5": Annotated[list[float], 5],
    "details": {
        "source_system": str,
        "curation_team": str,
        "flags": {
            "fact_checked": bool,
            "public_safe": bool,
        },
    },
}


def show_results(title, chunks):
    print(f"\n{title}")
    if not chunks:
        print("  (no matches)")
        return
    for idx, chunk in enumerate(chunks, start=1):
        print(f"  {idx}. {chunk.text.strip()!r}")
        print(f"     attributes={chunk.attributes}")


complex_store = DuckDBStore.create(
    location=":memory:", embed=None, attributes=COMPLEX_SCHEMA
)
chunker = MarkdownChunker()

for doc in [
    MarkdownDocument(
        origin="a1001_curator_fact_checked.md",
        content="Curator note: Bronze owl iconography linked to Athena in regional finds.",
        attributes={
            "artifact_id": "A1001",
            "note_type": "curator_note",
            "priority": 10,
            "embedding5": [float(i) for i in range(5)],
            "details": {
                "source_system": "collections_db",
                "curation_team": "ancient_mediterranean",
                "flags": {"fact_checked": True, "public_safe": True},
            },
        },
    ),
    MarkdownDocument(
        origin="a1001_internal_research.md",
        content="Internal research memo: attribution hypothesis still under review.",
        attributes={
            "artifact_id": "A1001",
            "note_type": "research_memo",
            "priority": 4,
            "embedding5": [float(i + 1) for i in range(5)],
            "details": {
                "source_system": "collections_db",
                "curation_team": "ancient_mediterranean",
                "flags": {"fact_checked": False, "public_safe": False},
            },
        },
    ),
    MarkdownDocument(
        origin="a2042_curator_fact_checked.md",
        content="Curator note: ceramic bowl motifs align with late classical workshop styles.",
        attributes={
            "artifact_id": "A2042",
            "note_type": "curator_note",
            "priority": 8,
            "embedding5": [float(i + 2) for i in range(5)],
            "details": {
                "source_system": "collections_db",
                "curation_team": "classical_art",
                "flags": {"fact_checked": True, "public_safe": True},
            },
        },
    ),
]:
    complex_store.upsert(chunker.chunk(doc))

complex_store.build_index("bm25")

show_results(
    "Dot-path SQL-like filter",
    complex_store.retrieve(
        "bronze owl",
        top_k=3,
        deoverlap=False,
        attributes_filter="""
        artifact_id = 'A1001'
        AND details.curation_team = 'ancient_mediterranean'
        AND details.flags.fact_checked = TRUE
        """,
    ),
)

show_results(
    "Equivalent AST filter with dot-path keys",
    complex_store.retrieve(
        "bronze owl",
        top_k=3,
        deoverlap=False,
        attributes_filter={
            "type": "and",
            "filters": [
                {"type": "eq", "key": "artifact_id", "value": "A1001"},
                {
                    "type": "eq",
                    "key": "details.curation_team",
                    "value": "ancient_mediterranean",
                },
                {"type": "eq", "key": "details.flags.fact_checked", "value": True},
            ],
        },
    ),
)
```

## Recap: Declaring Attributes

Raghilda supports multiple schema declaration styles:

- Mapping with scalar types (for example `{"artifact_id": str, "priority": int}`)
- Class annotations (for example `class Attributes: artifact_id: str`)
- Mapping entries with defaults via `(type, default)` tuples
- Fixed-size vectors via `Annotated[list[float], N]`

Attribute names should use identifier-style syntax (letters, digits, and
underscores only), so names cannot contain dots (`.`) or dashes (`-`).

Backend support differs by store:

| Capability | DuckDBStore | ChromaDBStore | OpenAIStore | PostgreSQLStore |
|---|---|---|---|---|
| Scalar attributes (`str/int/float/bool`) | ✅ | ✅ | ✅ | ✅ |
| Class-based schema declarations | ✅ | ✅ | ✅ | ✅ |
| Optional/defaulted attributes | ✅ | ❌ | ❌ | ✅ |
| Nested object attributes | ✅ | ❌ | ❌ | ✅ (JSONB) |
| Vector attributes (`Annotated[list[float], N]`) | ✅ | ❌ | ❌ | ✅ (pgvector) |
| Per-chunk attribute overrides | ✅ | ✅ | ❌ (document-level only) | ✅ |

## Built-In Backend Columns

In addition to declared attributes, some stores expose backend-managed columns
that can be used in `attributes_filter`.

| Column | Meaning |
|---|---|
| `chunk_id` | Chunk index within the document (0-based insertion order). |
| `origin` | Source location from `MarkdownDocument.origin` (path, URL, or other source label). |
| `start_index` | Chunk start character offset in the document text. |
| `end_index` | Chunk end character offset in the document text. |
| `context` | Chunk context string (if provided during chunking). |

`DuckDBStore`, `ChromaDBStore`, and `PostgreSQLStore` support these names directly.
`OpenAIStore` does not support built-in backend columns in `attributes_filter`;
only declared attributes are filterable.


## ChromaDB Store

### Getting Started

ChromaDB is an open-source vector database designed for AI applications. raghilda's
`ChromaDBStore` provides a convenient interface for storing and retrieving document
chunks using ChromaDB as the backend.

## Installation

ChromaDB is an optional dependency. Install it with:

```bash
pip install chromadb
```

Or install raghilda with ChromaDB support:

```bash
pip install "raghilda[chromadb]"
```

## Creating a Store

Create a new ChromaDB store with `ChromaDBStore.create()`:

```{python}
#| eval: false
from raghilda.store import ChromaDBStore
from raghilda.embedding import EmbeddingOpenAI

# Create a persistent store
store = ChromaDBStore.create(
    location="my_vector_store",
    name="documents",
    embed=EmbeddingOpenAI(),
)
```

### Parameters

| Parameter | Description |
|-----------|-------------|
| `location` | Path for persistent storage. Use `":memory:"` or `None` for in-memory storage. |
| `name` | Collection name within the store. Defaults to `"raghilda_chroma"`. |
| `title` | Human-readable title for the store. |
| `embed` | Embedding function (raghilda provider or ChromaDB function). |
| `overwrite` | If `True`, delete existing collection with the same name. |
| `client` | Optional pre-configured ChromaDB client (e.g., `HttpClient`). |

### In-Memory vs Persistent Storage

```{python}
#| eval: false
# In-memory store (data lost when process ends)
store = ChromaDBStore.create(location=":memory:", embed=EmbeddingOpenAI())

# Persistent store (data saved to disk)
store = ChromaDBStore.create(location="./my_store", embed=EmbeddingOpenAI())
```

## Inserting Documents

Insert chunked documents into the store:

```{python}
#| eval: false
from raghilda.document import MarkdownDocument
from raghilda.chunker import MarkdownChunker

# Create and chunk a document
doc = MarkdownDocument(
    origin="example.md",
    content="# Hello World\n\nThis is a sample document with some content."
)

chunker = MarkdownChunker()
chunked_doc = chunker.chunk(doc)

# Insert into store
store.upsert(chunked_doc)
```

### Multiple Documents

For multiple documents, read and chunk each item before calling `upsert()`:

```{python}
#| eval: false
from raghilda.read import read_as_markdown
from raghilda.chunker import MarkdownChunker

files = [
    "docs/guide.md",
    "docs/reference.md",
    "docs/tutorial.md",
]
chunker = MarkdownChunker(chunk_size=500)

for uri in files:
    doc = read_as_markdown(uri)
    store.upsert(chunker.chunk(doc))
```

## Retrieving Documents

Search for relevant chunks using semantic similarity:

```{python}
#| eval: false
# Find the 5 most relevant chunks
results = store.retrieve("How do I get started?", top_k=5)

for chunk in results:
    print(f"Score: {chunk.metrics[0].value:.4f}")
    print(f"Text: {chunk.text[:100]}...")
    print()
```

### Deoverlapping Results

By default, overlapping chunks from the same document are merged:

```{python}
#| eval: false
# Merged overlapping chunks (default)
results = store.retrieve("query", top_k=5, deoverlap=True)

# Keep chunks separate
results = store.retrieve("query", top_k=5, deoverlap=False)
```

When chunks are merged, metric values are preserved and user attributes are
aggregated into per-chunk lists in start-order. The merged chunk keeps
`context` from the first overlapping chunk.

### Attribute Filtering

Use `attributes_filter` to narrow retrieval results by declared attributes
and built-in filterable columns (for example `origin`):

```{python}
#| eval: false
# Filter by document origin
results = store.retrieve(
    "query",
    top_k=5,
    attributes_filter="origin = 'guide.md'"
)
```

Chroma built-in filterable columns are:
`chunk_id`, `start_index`, `end_index`, `char_count`, `context`, and `origin`.

For advanced Chroma-specific filters, you can still pass `where=...` directly.

## Connecting to Existing Stores

Reconnect to a previously created store:

```{python}
#| eval: false
# Connect to existing store
store = ChromaDBStore.connect(
    name="documents",
    location="my_vector_store",
)

# Check how many documents are stored
print(f"Documents in store: {store.size()}")
```

::: {.callout-note}
When using ChromaDB's built-in embedding functions or raghilda's `EmbeddingOpenAI`/`EmbeddingCohere`,
the embedding function is automatically restored from the stored configuration.
See [Embedding Functions](51-chromadb-embedding-functions.qmd) for details.
:::

## Using a Remote ChromaDB Server

Connect to a ChromaDB server running elsewhere:

```{python}
#| eval: false
import chromadb

# Connect to remote ChromaDB server
client = chromadb.HttpClient(host="localhost", port=8000)

# Use the client with raghilda
store = ChromaDBStore.create(
    client=client,
    name="documents",
    embed=EmbeddingOpenAI(),
)
```

## Complete Example

Here's a complete workflow from document preparation to retrieval:

```{python}
#| eval: false
from raghilda.store import ChromaDBStore
from raghilda.embedding import EmbeddingOpenAI
from raghilda.read import read_as_markdown
from raghilda.chunker import MarkdownChunker

# 1. Create store
store = ChromaDBStore.create(
    location="knowledge_base",
    name="docs",
    embed=EmbeddingOpenAI(),
    overwrite=True,
)

# 2. Prepare and insert documents
chunker = MarkdownChunker()
for path in [
    "README.md",
    "docs/getting-started.md",
    "docs/api-reference.md",
]:
    store.upsert(chunker.chunk(read_as_markdown(path)))

print(f"Indexed {store.size()} documents")

# 3. Search
results = store.retrieve("How do I install the package?", top_k=3)

for i, chunk in enumerate(results, 1):
    print(f"\n--- Result {i} ---")
    print(f"From: {chunk.origin}")
    print(f"Text: {chunk.text[:200]}...")
```

## Next Steps

- Learn about [Embedding Functions](51-chromadb-embedding-functions.qmd) for advanced embedding configuration
- Explore the [API Reference](/reference/store.ChromaDBStore.qmd) for all available methods


### Embedding Functions

Embedding functions convert text into numerical vectors that capture semantic meaning.
These vectors enable similarity search — finding documents that are conceptually related
to a query, even when they don't share exact keywords.

When using `ChromaDBStore`, you can pass embedding functions via the `embed` parameter.
raghilda handles the conversion automatically, supporting both raghilda embedding providers
and ChromaDB's native embedding functions.

## Basic Usage

```{python}
#| eval: false
from raghilda.embedding import EmbeddingOpenAI
from raghilda.store import ChromaDBStore

# Create a store with a raghilda embedding provider
store = ChromaDBStore.create(
    location="my_store",
    name="documents",
    embed=EmbeddingOpenAI(model="text-embedding-3-small"),
)
```

## Three Approaches to Embedding Functions

There are three ways to provide embedding functions to ChromaDB, each with different
trade-offs:

### 1. raghilda Providers with Native ChromaDB Equivalents (Recommended)

For `EmbeddingOpenAI` and `EmbeddingCohere`, raghilda automatically converts to
ChromaDB's built-in embedding functions:

```{python}
#| eval: false
from raghilda.embedding import EmbeddingOpenAI
from raghilda.store import ChromaDBStore

store = ChromaDBStore.create(
    location="my_store",
    embed=EmbeddingOpenAI(model="text-embedding-3-small"),
)
```

**Benefits:**

- Full serialization support — ChromaDB can restore the embedding function when reconnecting
- Cross-language compatibility — TypeScript clients can access the same collection
- Proper query/document embedding distinction for providers like Cohere

**How it works:** `ChromaDBStore` recognizes these providers and uses the equivalent
native ChromaDB embedding function internally (for example `OpenAIEmbeddingFunction`).

### 2. ChromaDB Embedding Functions Directly

You can pass ChromaDB's built-in embedding functions directly:

```{python}
#| eval: false
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
from raghilda.store import ChromaDBStore

chroma_embed = OpenAIEmbeddingFunction(
    model_name="text-embedding-3-small",
    api_key_env_var="OPENAI_API_KEY",
)

store = ChromaDBStore.create(
    location="my_store",
    embed=chroma_embed,
)
```

**Benefits:**

- Direct control over ChromaDB configuration
- Access to ChromaDB-specific features
- Full serialization and cross-language support

**When to use:** When you need ChromaDB-specific options not exposed by raghilda providers.

### 3. Custom Embedding Providers

For custom `EmbeddingProvider` implementations without a ChromaDB equivalent,
raghilda automatically adapts them internally:

```{python}
#| eval: false
from raghilda.embedding import EmbeddingProvider, EmbedInputType, register_embedding_provider

@register_embedding_provider("MyCustomEmbedding")
class MyCustomEmbedding(EmbeddingProvider):
    def __init__(self, model: str = "custom-model"):
        self.model = model
        # Initialize your embedding client

    def embed(self, x, input_type=EmbedInputType.DOCUMENT):
        # Generate embeddings using your custom logic
        return [[0.1, 0.2, 0.3] for _ in x]

    def get_config(self):
        return {"type": "MyCustomEmbedding", "model": self.model}

    @classmethod
    def from_config(cls, config):
        return cls(model=config.get("model", "custom-model"))

# Use with ChromaDB - adapted automatically
store = ChromaDBStore.create(
    location="my_store",
    embed=MyCustomEmbedding(),
)
```

**Benefits:**

- Works with any `EmbeddingProvider` implementation
- Serialization support via raghilda's provider registry
- Proper query/document embedding handling

**Limitations:**

- Python-only — TypeScript clients cannot restore these providers
- Requires registering the provider with `@register_embedding_provider`

::: {.callout-important}
Custom providers must be registered with `@register_embedding_provider` for
serialization to work. The decorator ensures the provider can be restored when
reconnecting to an existing collection.
:::

## Reconnecting to Existing Collections

When reconnecting to a ChromaDB collection, the embedding function handling depends
on which approach you used:

### Native ChromaDB Functions (Approaches 1 & 2)

ChromaDB can automatically restore the embedding function from stored configuration:

```{python}
#| eval: false
# No need to specify embed — ChromaDB restores it automatically
store = ChromaDBStore.connect(
    name="documents",
    location="my_store",
)
```

### Custom Providers (Approach 3)

For custom providers, ensure the provider class is imported before connecting:

```{python}
#| eval: false
# Import to register the provider
from my_package import MyCustomEmbedding

# ChromaDB + raghilda restore the provider from config
store = ChromaDBStore.connect(
    name="documents",
    location="my_store",
)
```

## API Key Handling

raghilda embedding providers intelligently handle API keys for ChromaDB compatibility:

1. **Environment variables (recommended):** Set `OPENAI_API_KEY` or `CO_API_KEY` and
   the provider will configure ChromaDB to use them for persistence.

2. **ChromaDB-specific variables:** If `CHROMA_OPENAI_API_KEY` or `CHROMA_COHERE_API_KEY`
   are set, those take precedence.

3. **Direct API keys:** You can pass `api_key` directly, but ChromaDB will emit a
   deprecation warning since direct keys aren't persisted.

```{python}
#| eval: false
import os
os.environ["OPENAI_API_KEY"] = "sk-..."

# Provider uses the environment variable — persists correctly
provider = EmbeddingOpenAI()
store = ChromaDBStore.create(location="my_store", embed=provider)

# Later, reconnect without specifying the key
store = ChromaDBStore.connect(name="raghilda_chroma", location="my_store")
```

## Choosing the Right Approach

| Scenario | Recommended Approach |
|----------|---------------------|
| OpenAI or Cohere embeddings | raghilda provider (Approach 1) |
| Need TypeScript client access | ChromaDB function directly (Approach 2) |
| Custom embedding model | Custom provider with adapter (Approach 3) |
| Maximum portability | ChromaDB function directly (Approach 2) |
| Unified raghilda API | raghilda provider (Approach 1 or 3) |

For most use cases, using raghilda's built-in providers (`EmbeddingOpenAI`,
`EmbeddingCohere`) provides the best balance of convenience and compatibility.