store.DuckDBStore

A vector store backed by DuckDB.

Usage

Source

store.DuckDBStore()

DuckDBStore provides local vector storage with support for both semantic search (using embeddings) and full-text search (using BM25). Data is persisted to a DuckDB database file.

Examples

from raghilda.store import DuckDBStore
from raghilda.embedding import EmbeddingOpenAI

# Create a new store with embeddings
store = DuckDBStore.create(
    location="my_store.db",
    embed=EmbeddingOpenAI(),
)

# Insert a chunked document
from raghilda.document import MarkdownDocument
from raghilda.chunker import MarkdownChunker

doc = MarkdownDocument(
    origin="https://example.com/doc1.md",
    content="# Example

This is a sample document.",
)
store.upsert(MarkdownChunker().chunk(doc))

# Retrieve similar chunks
chunks = store.retrieve("How do I use this?", top_k=5)

Methods

Name Description
build_index() Build the specified index types on the embeddings table.
connect() Connect to an existing DuckDB store.
create() Create a new DuckDB store.
retrieve() Retrieve the most similar chunks to the given text.
retrieve_bm25() Retrieve chunks using BM25 full-text search.
retrieve_vss() Retrieve chunks using vector similarity search.
upsert() Upsert a document into the store.

build_index()

Build the specified index types on the embeddings table.

Usage

Source

build_index(type=None)
Parameters
type: Optional[IndexType | str | list[IndexType | str]] = None
The type of index to build. Can be a single IndexType/string ("bm25" or "hnsw") or a list of those values. If None, builds both BM25 and HNSW indexes.

connect()

Connect to an existing DuckDB store.

Usage

Source

connect(location=":memory:", read_only=False)
Parameters
location: str | Path = ":memory:"

Path to the DuckDB database file.

read_only: bool = False
Whether to open the database in read-only mode.
Returns
DuckDBStore
A connected store instance.

create()

Create a new DuckDB store.

Usage

Source

create(
    location, embed, overwrite=False, name=None, title=None, attributes=None
)
Parameters
location: str | Path

Path where the DuckDB database file will be created.

embed: Optional[EmbeddingProvider]

Embedding provider for generating vector embeddings. If None, only full-text search will be available.

overwrite: bool = False

Whether to overwrite an existing database at the location.

name: Optional[str] = None

Internal name for the store.

title: Optional[str] = None

Human-readable title for the store.

attributes: Optional[AttributesSchemaSpec] = None
Optional schema for user-defined attribute columns stored per chunk. Example: {"tenant": str, "priority": int}. Attribute names use identifier-style syntax. The following names are reserved and cannot be used as attributes: chunk_id, context, embedding, origin, text, start_index, end_index, char_count, metric_name, and metric_value.
Returns
DuckDBStore
A newly created store instance.

retrieve()

Retrieve the most similar chunks to the given text.

Usage

Source

retrieve(text, top_k=3, *, deoverlap=True, attributes_filter=None)

Combines results from vector similarity search (if embeddings are available) and BM25 full-text search, then optionally merges overlapping chunks.

Parameters
text: str

The query text to search for.

top_k: int = 3

The maximum number of chunks to return from each retrieval method (VSS and BM25). Because results from both methods are combined before deoverlapping, the final count may differ from top_k.

deoverlap: bool = True

If True (default), merge overlapping chunks from the same document. Overlapping chunks are identified by their start_index and end_index positions. When merged, the resulting chunk spans the union of the original ranges, combines metrics, and aggregates attribute values into per-chunk lists in start-order. The context value is kept from the first chunk in each merged overlap group.

attributes_filter: Optional[AttributeFilter] = None
Optional filter to scope retrieval using attribute columns. Can be a SQL-like string or a dict AST. Example string: "tenant = 'docs' AND priority >= 2". Supports declared attributes plus built-in columns: chunk_id, origin, start_index, end_index, char_count, and context.
Returns
Sequence[RetrievedDuckDBMarkdownChunk]
The retrieved chunks with their relevance metrics.

retrieve_bm25()

Retrieve chunks using BM25 full-text search.

Usage

Source

retrieve_bm25(
    query, top_k, *, k=1.2, b=0.75, conjunctive=False, attributes_filter=None
)

Uses DuckDB’s fts (Full-Text Search) extension for BM25 ranking. See https://duckdb.org/docs/extensions/full_text_search.html for more details.

Parameters
query: str

The search query text.

top_k: int

The maximum number of chunks to return.

k: float = 1.2

BM25 term frequency saturation parameter. Higher values increase the impact of term frequency. Default is 1.2.

b: float = 0.75

BM25 length normalization parameter (0-1). Higher values penalize longer documents more. Default is 0.75.

conjunctive: bool = False

If True, all query terms must be present (AND). If False (default), any query term can match (OR).

attributes_filter: Optional[AttributeFilter] = None
Optional attribute filter as SQL-like string or dict AST. Supports declared attributes plus built-in columns: chunk_id, origin, start_index, end_index, char_count, and context.
Returns
list[RetrievedDuckDBMarkdownChunk]
The matching chunks ranked by BM25 score.

retrieve_vss()

Retrieve chunks using vector similarity search.

Usage

Source

retrieve_vss(
    query, top_k, *, method=VSSMethod.COSINE_DISTANCE, attributes_filter=None
)

Uses DuckDB’s vss extension for vector similarity search. See https://duckdb.org/docs/extensions/vss.html for more details.

Parameters
query: str | Sequence[float]

The query text or embedding vector. If a string is provided, it will be embedded using the store’s embedding provider.

top_k: int

The maximum number of chunks to return.

method: VSSMethod = VSSMethod.COSINE_DISTANCE
The similarity method to use. Options are:
  • COSINE_DISTANCE: Cosine distance (default)
  • EUCLIDEAN_DISTANCE: L2/Euclidean distance
  • NEGATIVE_INNER_PRODUCT: Negative dot product
attributes_filter: Optional[AttributeFilter] = None
Optional attribute filter as SQL-like string or dict AST. Supports declared attributes plus built-in columns: chunk_id, origin, start_index, end_index, char_count, and context.
Returns
list[RetrievedDuckDBMarkdownChunk]
The most similar chunks with similarity metrics.
Raises
ValueError
If query is a string but no embedding provider is configured.

upsert()

Upsert a document into the store.

Usage

Source

upsert(document, *, skip_if_unchanged=True)

The document must be a ~raghilda.document.ChunkedMarkdownDocument. Use ~raghilda.chunker.MarkdownChunker to chunk a ~raghilda.document.MarkdownDocument before upserting.

Parameters
document: Document

The chunked document to upsert.

skip_if_unchanged: bool = True
If True (default), skip the write when the existing document for the same origin already has identical content and chunk layout. This avoids re-computing embeddings.