store.DuckDBStore

A vector store backed by DuckDB.

Usage

store.DuckDBStore(
    con,
    metadata,
)

DuckDBStore provides local vector storage with support for both semantic search (using embeddings) and full-text search (using BM25). Data is persisted to a DuckDB database file.

Examples

from raghilda.store import DuckDBStore
from raghilda.embedding import EmbeddingOpenAI

# Create a new store with embeddings
store = DuckDBStore.create(
    location="my_store.db",
    embed=EmbeddingOpenAI(),
)

# Insert a chunked document
from raghilda.document import MarkdownDocument
from raghilda.chunker import MarkdownChunker

doc = MarkdownDocument(
    origin="https://example.com/doc1.md",
    content="# Example

This is a sample document.",
)
store.upsert(MarkdownChunker().chunk(doc))

# Build indexes before retrieval
store.build_index()

# Retrieve similar chunks
chunks = store.retrieve("How do I use this?", top_k=5)

Methods

Name	Description
build_index()	Build the specified index types on the embeddings table.
connect()	Connect to an existing DuckDB store.
create()	Create a new DuckDB store.
retrieve()	Retrieve the most similar chunks to the given text.
retrieve_bm25()	Retrieve chunks using BM25 full-text search.
retrieve_vss()	Retrieve chunks using vector similarity search.
upsert()	Upsert a document into the store.

build_index()

Build the specified index types on the embeddings table.

Usage

Source

build_index(type=None)

Parameters

type: Optional[IndexType | str | list[IndexType | str]] = None: The type of index to build. Can be a single IndexType/string ("bm25" or "hnsw") or a list of those values. If None, builds both BM25 and HNSW indexes.

connect()

Connect to an existing DuckDB store.

Usage

Source

connect(location=":memory:", read_only=False)

Parameters

location: str | Path = ":memory:": Path to the DuckDB database file.
read_only: bool = False: Whether to open the database in read-only mode.

Returns

DuckDBStore: A connected store instance.

create()

Create a new DuckDB store.

Usage

Source

create(
    location, embed, overwrite=False, name=None, title=None, attributes=None
)

Parameters

location: str | Path: Path where the DuckDB database file will be created.
embed: Optional[EmbeddingProvider]: Embedding provider for generating vector embeddings. If None, only full-text search will be available.
overwrite: bool = False: Whether to overwrite an existing database at the location.
name: Optional[str] = None: Internal name for the store.
title: Optional[str] = None: Human-readable title for the store.
attributes: Optional[AttributesSchemaSpec] = None: Optional schema for user-defined attribute columns stored per chunk. Example: {"tenant": str, "priority": int}. Attribute names use identifier-style syntax. The following names are reserved and cannot be used as attributes: chunk_id, context, embedding, origin, text, start_index, end_index, char_count, metric_name, and metric_value.

Returns

DuckDBStore: A newly created store instance.

retrieve()

Retrieve the most similar chunks to the given text.

Usage

Source

retrieve(text, top_k=3, *, deoverlap=True, attributes_filter=None)

Combines results from vector similarity search (if embeddings are available) and BM25 full-text search, then optionally merges overlapping chunks.

Parameters

text: str: The query text to search for.
top_k: int = 3: The maximum number of chunks to return from each retrieval method (VSS and BM25). Because results from both methods are combined before deoverlapping, the final count may differ from top_k.
deoverlap: bool = True: If True (default), merge overlapping chunks from the same document. Overlapping chunks are identified by their start_index and end_index positions. When merged, the resulting chunk spans the union of the original ranges, combines metrics, and aggregates attribute values into per-chunk lists in start-order. The context value is kept from the first chunk in each merged overlap group.
attributes_filter: Optional[AttributeFilter] = None: Optional filter to scope retrieval using attribute columns. Can be a SQL-like string or a dict AST. Example string: "tenant = 'docs' AND priority >= 2". Supports declared attributes plus built-in columns: chunk_id, origin, start_index, end_index, char_count, and context.

Returns

Sequence[RetrievedDuckDBMarkdownChunk]: The retrieved chunks with their relevance metrics.

retrieve_bm25()

Retrieve chunks using BM25 full-text search.

Usage

Source

retrieve_bm25(
    query, top_k, *, k=1.2, b=0.75, conjunctive=False, attributes_filter=None
)

Uses DuckDB’s fts (Full-Text Search) extension for BM25 ranking. See https://duckdb.org/docs/extensions/full_text_search.html for more details.

Parameters

query: str: The search query text.
top_k: int: The maximum number of chunks to return.
k: float = 1.2: BM25 term frequency saturation parameter. Higher values increase the impact of term frequency. Default is 1.2.
b: float = 0.75: BM25 length normalization parameter (0-1). Higher values penalize longer documents more. Default is 0.75.
conjunctive: bool = False: If True, all query terms must be present (AND). If False (default), any query term can match (OR).
attributes_filter: Optional[AttributeFilter] = None: Optional attribute filter as SQL-like string or dict AST. Supports declared attributes plus built-in columns: chunk_id, origin, start_index, end_index, char_count, and context.

Returns

list[RetrievedDuckDBMarkdownChunk]: The matching chunks ranked by BM25 score.

retrieve_vss()

Retrieve chunks using vector similarity search.

Usage

Source

retrieve_vss(
    query, top_k, *, method=VSSMethod.COSINE_DISTANCE, attributes_filter=None
)

Uses DuckDB’s vss extension for vector similarity search. See https://duckdb.org/docs/extensions/vss.html for more details.

Parameters

query: str | Sequence[float]

The query text or embedding vector. If a string is provided, it will be embedded using the store’s embedding provider.

top_k: int

The maximum number of chunks to return.

method: VSSMethod = VSSMethod.COSINE_DISTANCE

The similarity method to use. Options are:

COSINE_DISTANCE: Cosine distance (default)
EUCLIDEAN_DISTANCE: L2/Euclidean distance
NEGATIVE_INNER_PRODUCT: Negative dot product

attributes_filter: Optional[AttributeFilter] = None

Optional attribute filter as SQL-like string or dict AST. Supports declared attributes plus built-in columns: chunk_id, origin, start_index, end_index, char_count, and context.

Returns

list[RetrievedDuckDBMarkdownChunk]: The most similar chunks with similarity metrics.

Raises

ValueError: If query is a string but no embedding provider is configured.

upsert()

Upsert a document into the store.

Usage

Source

upsert(document, *, skip_if_unchanged=True)

The document must be a ~raghilda.document.ChunkedMarkdownDocument. Use ~raghilda.chunker.MarkdownChunker to chunk a ~raghilda.document.MarkdownDocument before upserting.

Parameters

document: Document: The chunked document to upsert.
skip_if_unchanged: bool = True: If True (default), skip the write when the existing document for the same origin already has identical content and chunk layout. This avoids re-computing embeddings.