Getting Started

ChromaDB is an open-source vector database designed for AI applications. raghilda’s ChromaDBStore provides a convenient interface for storing and retrieving document chunks using ChromaDB as the backend.

Installation

ChromaDB is an optional dependency. Install it with:

pip install chromadb

Or install raghilda with ChromaDB support:

pip install "raghilda[chromadb]"

Creating a Store

Create a new ChromaDB store with ChromaDBStore.create():

from raghilda.store import ChromaDBStore
from raghilda.embedding import EmbeddingOpenAI

# Create a persistent store
store = ChromaDBStore.create(
    location="my_vector_store",
    name="documents",
    embed=EmbeddingOpenAI(),
)

Parameters

Parameter Description
location Path for persistent storage. Use ":memory:" or None for in-memory storage.
name Collection name within the store. Defaults to "raghilda_chroma".
title Human-readable title for the store.
embed Embedding function (raghilda provider or ChromaDB function).
overwrite If True, delete existing collection with the same name.
client Optional pre-configured ChromaDB client (e.g., HttpClient).

In-Memory vs Persistent Storage

# In-memory store (data lost when process ends)
store = ChromaDBStore.create(location=":memory:", embed=EmbeddingOpenAI())

# Persistent store (data saved to disk)
store = ChromaDBStore.create(location="./my_store", embed=EmbeddingOpenAI())

Inserting Documents

Insert chunked documents into the store:

from raghilda.document import MarkdownDocument
from raghilda.chunker import MarkdownChunker

# Create and chunk a document
doc = MarkdownDocument(
    origin="example.md",
    content="# Hello World\n\nThis is a sample document with some content."
)

chunker = MarkdownChunker()
chunked_doc = chunker.chunk(doc)

# Insert into store
store.upsert(chunked_doc)

Multiple Documents

For multiple documents, read and chunk each item before calling upsert():

from raghilda.read import read_as_markdown
from raghilda.chunker import MarkdownChunker

files = [
    "docs/guide.md",
    "docs/reference.md",
    "docs/tutorial.md",
]
chunker = MarkdownChunker(chunk_size=500)

for uri in files:
    doc = read_as_markdown(uri)
    store.upsert(chunker.chunk(doc))

Retrieving Documents

Search for relevant chunks using semantic similarity:

# Find the 5 most relevant chunks
results = store.retrieve("How do I get started?", top_k=5)

for chunk in results:
    print(f"Score: {chunk.metrics[0].value:.4f}")
    print(f"Text: {chunk.text[:100]}...")
    print()

Deoverlapping Results

By default, overlapping chunks from the same document are merged:

# Merged overlapping chunks (default)
results = store.retrieve("query", top_k=5, deoverlap=True)

# Keep chunks separate
results = store.retrieve("query", top_k=5, deoverlap=False)

When chunks are merged, metric values are preserved and user attributes are aggregated into per-chunk lists in start-order. The merged chunk keeps context from the first overlapping chunk.

Attribute Filtering

Use attributes_filter to narrow retrieval results by declared attributes and built-in filterable columns (for example origin):

# Filter by document origin
results = store.retrieve(
    "query",
    top_k=5,
    attributes_filter="origin = 'guide.md'"
)

Chroma built-in filterable columns are: chunk_id, start_index, end_index, char_count, context, and origin.

For advanced Chroma-specific filters, you can still pass where=... directly.

Connecting to Existing Stores

Reconnect to a previously created store:

# Connect to existing store
store = ChromaDBStore.connect(
    name="documents",
    location="my_vector_store",
)

# Check how many documents are stored
print(f"Documents in store: {store.size()}")
Note

When using ChromaDB’s built-in embedding functions or raghilda’s EmbeddingOpenAI/EmbeddingCohere, the embedding function is automatically restored from the stored configuration. See Embedding Functions for details.

Using a Remote ChromaDB Server

Connect to a ChromaDB server running elsewhere:

import chromadb

# Connect to remote ChromaDB server
client = chromadb.HttpClient(host="localhost", port=8000)

# Use the client with raghilda
store = ChromaDBStore.create(
    client=client,
    name="documents",
    embed=EmbeddingOpenAI(),
)

Complete Example

Here’s a complete workflow from document preparation to retrieval:

from raghilda.store import ChromaDBStore
from raghilda.embedding import EmbeddingOpenAI
from raghilda.read import read_as_markdown
from raghilda.chunker import MarkdownChunker

# 1. Create store
store = ChromaDBStore.create(
    location="knowledge_base",
    name="docs",
    embed=EmbeddingOpenAI(),
    overwrite=True,
)

# 2. Prepare and insert documents
chunker = MarkdownChunker()
for path in [
    "README.md",
    "docs/getting-started.md",
    "docs/api-reference.md",
]:
    store.upsert(chunker.chunk(read_as_markdown(path)))

print(f"Indexed {store.size()} documents")

# 3. Search
results = store.retrieve("How do I install the package?", top_k=3)

for i, chunk in enumerate(results, 1):
    print(f"\n--- Result {i} ---")
    print(f"From: {chunk.origin}")
    print(f"Text: {chunk.text[:200]}...")

Next Steps