from raghilda.store import DuckDBStore
from raghilda.embedding import EmbeddingOpenAI
# Create a new store with embeddings
store = DuckDBStore.create(
location="my_store.db",
embed=EmbeddingOpenAI(),
)
# Insert a chunked document
from raghilda.document import MarkdownDocument
from raghilda.chunker import MarkdownChunker
doc = MarkdownDocument(
origin="https://example.com/doc1.md",
content="# Example
This is a sample document.",
)
store.upsert(MarkdownChunker().chunk(doc))
# Retrieve similar chunks
chunks = store.retrieve("How do I use this?", top_k=5)store.DuckDBStore
A vector store backed by DuckDB.
Usage
store.DuckDBStore()DuckDBStore provides local vector storage with support for both semantic search (using embeddings) and full-text search (using BM25). Data is persisted to a DuckDB database file.
Examples
Methods
| Name | Description |
|---|---|
| build_index() | Build the specified index types on the embeddings table. |
| connect() | Connect to an existing DuckDB store. |
| create() | Create a new DuckDB store. |
| retrieve() | Retrieve the most similar chunks to the given text. |
| retrieve_bm25() | Retrieve chunks using BM25 full-text search. |
| retrieve_vss() | Retrieve chunks using vector similarity search. |
| upsert() | Upsert a document into the store. |
build_index()
Build the specified index types on the embeddings table.
Usage
build_index(type=None)Parameters
type: Optional[IndexType | str | list[IndexType | str]] = None-
The type of index to build. Can be a single IndexType/string (
"bm25"or"hnsw") or a list of those values. If None, builds both BM25 and HNSW indexes.
connect()
Connect to an existing DuckDB store.
Usage
connect(location=":memory:", read_only=False)Parameters
location: str | Path = ":memory:"-
Path to the DuckDB database file.
read_only: bool = False- Whether to open the database in read-only mode.
Returns
DuckDBStore- A connected store instance.
create()
Create a new DuckDB store.
Usage
create(
location, embed, overwrite=False, name=None, title=None, attributes=None
)Parameters
location: str | Path-
Path where the DuckDB database file will be created.
embed: Optional[EmbeddingProvider]-
Embedding provider for generating vector embeddings. If None, only full-text search will be available.
overwrite: bool = False-
Whether to overwrite an existing database at the location.
name: Optional[str] = None-
Internal name for the store.
title: Optional[str] = None-
Human-readable title for the store.
attributes: Optional[AttributesSchemaSpec] = None-
Optional schema for user-defined attribute columns stored per chunk. Example:
{"tenant": str, "priority": int}. Attribute names use identifier-style syntax. The following names are reserved and cannot be used as attributes:chunk_id,context,embedding,origin,text,start_index,end_index,char_count,metric_name, andmetric_value.
Returns
DuckDBStore- A newly created store instance.
retrieve()
Retrieve the most similar chunks to the given text.
Usage
retrieve(text, top_k=3, *, deoverlap=True, attributes_filter=None)Combines results from vector similarity search (if embeddings are available) and BM25 full-text search, then optionally merges overlapping chunks.
Parameters
text: str-
The query text to search for.
top_k: int = 3-
The maximum number of chunks to return from each retrieval method (VSS and BM25). Because results from both methods are combined before deoverlapping, the final count may differ from
top_k. deoverlap: bool = True-
If True (default), merge overlapping chunks from the same document. Overlapping chunks are identified by their
start_indexandend_indexpositions. When merged, the resulting chunk spans the union of the original ranges, combines metrics, and aggregates attribute values into per-chunk lists in start-order. Thecontextvalue is kept from the first chunk in each merged overlap group. attributes_filter: Optional[AttributeFilter] = None-
Optional filter to scope retrieval using attribute columns. Can be a SQL-like string or a dict AST. Example string:
"tenant = 'docs' AND priority >= 2". Supports declared attributes plus built-in columns:chunk_id,origin,start_index,end_index,char_count, andcontext.
Returns
Sequence[RetrievedDuckDBMarkdownChunk]- The retrieved chunks with their relevance metrics.
retrieve_bm25()
Retrieve chunks using BM25 full-text search.
Usage
retrieve_bm25(
query, top_k, *, k=1.2, b=0.75, conjunctive=False, attributes_filter=None
)Uses DuckDB’s fts (Full-Text Search) extension for BM25 ranking. See https://duckdb.org/docs/extensions/full_text_search.html for more details.
Parameters
query: str-
The search query text.
top_k: int-
The maximum number of chunks to return.
k: float = 1.2-
BM25 term frequency saturation parameter. Higher values increase the impact of term frequency. Default is 1.2.
b: float = 0.75-
BM25 length normalization parameter (0-1). Higher values penalize longer documents more. Default is 0.75.
conjunctive: bool = False-
If True, all query terms must be present (AND). If False (default), any query term can match (OR).
attributes_filter: Optional[AttributeFilter] = None-
Optional attribute filter as SQL-like string or dict AST. Supports declared attributes plus built-in columns:
chunk_id,origin,start_index,end_index,char_count, andcontext.
Returns
list[RetrievedDuckDBMarkdownChunk]- The matching chunks ranked by BM25 score.
retrieve_vss()
Retrieve chunks using vector similarity search.
Usage
retrieve_vss(
query, top_k, *, method=VSSMethod.COSINE_DISTANCE, attributes_filter=None
)Uses DuckDB’s vss extension for vector similarity search. See https://duckdb.org/docs/extensions/vss.html for more details.
Parameters
query: str | Sequence[float]-
The query text or embedding vector. If a string is provided, it will be embedded using the store’s embedding provider.
top_k: int-
The maximum number of chunks to return.
method: VSSMethod = VSSMethod.COSINE_DISTANCE-
The similarity method to use. Options are:
COSINE_DISTANCE: Cosine distance (default)EUCLIDEAN_DISTANCE: L2/Euclidean distanceNEGATIVE_INNER_PRODUCT: Negative dot product
attributes_filter: Optional[AttributeFilter] = None-
Optional attribute filter as SQL-like string or dict AST. Supports declared attributes plus built-in columns:
chunk_id,origin,start_index,end_index,char_count, andcontext.
Returns
list[RetrievedDuckDBMarkdownChunk]- The most similar chunks with similarity metrics.
Raises
ValueError- If query is a string but no embedding provider is configured.
upsert()
Upsert a document into the store.
Usage
upsert(document, *, skip_if_unchanged=True)The document must be a ~raghilda.document.ChunkedMarkdownDocument. Use ~raghilda.chunker.MarkdownChunker to chunk a ~raghilda.document.MarkdownDocument before upserting.
Parameters
document: Document-
The chunked document to upsert.
skip_if_unchanged: bool = True- If True (default), skip the write when the existing document for the same origin already has identical content and chunk layout. This avoids re-computing embeddings.