Core Concepts

Large language models (LLMs) sometimes generate confident but incorrect information—a phenomenon known as hallucination. This happens because LLMs work by predicting the most likely next words based on patterns learned during training, without any inherent concept of truth or factual accuracy.

Why RAG?

Retrieval-Augmented Generation (RAG) addresses this by grounding LLM responses in trusted source material. Instead of relying solely on the model’s training data, RAG retrieves relevant content from a curated knowledge base and includes it in the prompt. This shifts the model’s role from open-ended generation to summarizing vetted content.

While RAG doesn’t eliminate hallucinations entirely, it significantly reduces them for domain-specific applications by ensuring responses are anchored in authoritative sources.

Building a RAG System

A RAG system has two main phases:

  1. Preparation: Building a searchable knowledge store from your documents
  2. Retrieval: Finding relevant content to augment LLM prompts

Let’s walk through building a RAG system using the Quarto documentation as our knowledge base.

Creating a Store

First, create a store with an embedding provider. The store will hold your document chunks and their vector embeddings:

from raghilda.store import DuckDBStore
from raghilda.embedding import EmbeddingOpenAI

store = DuckDBStore.create(
    location="quarto_docs.db",
    embed=EmbeddingOpenAI(),
    name="quarto",
    title="Quarto Documentation",
    overwrite=True,
)

raghilda supports multiple embedding providers (OpenAI, Cohere) and storage backends (DuckDB, ChromaDB, OpenAI Vector Stores, PostgreSQL). See the API Reference for all options.

Finding Documents

Next, identify the documents to include. The find_links() function can crawl a website to discover pages:

from raghilda.scrape import find_links

links = find_links(
    "https://quarto.org/docs/guide/",
    depth=1,  # follow links 1 level deep from the starting page
    children_only=True,
)
print(f"Found {len(links)} pages")
Found 2 pages

The depth parameter controls how many levels of links to follow, and children_only=True restricts crawling to pages under the starting URL.

You can also work with local files or provide a list of URLs directly:

# Local files
links = ["docs/guide.md", "docs/reference.md", "docs/tutorial.md"]

# Or use glob patterns with pathlib
from pathlib import Path
links = list(Path("docs").glob("**/*.md"))

Preparing Documents

Prepare each document explicitly by reading it, chunking it, and passing the result to upsert():

from raghilda.read import read_as_markdown
from raghilda.chunker import MarkdownChunker

chunker = MarkdownChunker()

for link in links:
    document = read_as_markdown(link)
    chunked = chunker.chunk(document)
    store.upsert(chunked)

print(f"Indexed {store.size()} documents")
Indexed 2 documents

That is the full preparation phase. Each document is converted to Markdown, split into overlapping chunks, embedded, and written to the store through explicit calls that keep the indexing pipeline visible.

What Happens During Preparation

Each item you index typically goes through two steps before it is stored:

1. Convert to Markdownread_as_markdown() converts the item (a URL or file path) into a Markdown document. It handles HTML pages, PDFs, DOCX files, and more using MarkItDown. For HTML, it extracts the <main> element and removes <nav> elements by default.

from raghilda.read import read_as_markdown

doc = read_as_markdown("https://quarto.org/docs/guide/")
print(doc.content[:500])
# Guide – Quarto

# Guide

Comprehensive guide to using Quarto. If you are just starting out, you may want to explore the [tutorials](../../docs/get-started/index.html) to learn the basics.

#### Authoring

###### Create content with markdown

* [Markdown Basics](../../docs/authoring/markdown-basics.html)
* [Figures](../../docs/authoring/figures.html)
* [Tables](../../docs/authoring/tables.html)
* [Diagrams](../../docs/authoring/diagrams.html)
* [Citations](../../docs/authoring/citations.html)
*

2. Chunk the documentMarkdownChunker splits the Markdown into overlapping chunks at semantic boundaries (headings, paragraphs, sentences). The defaults are a chunk size of 1600 characters with 50% overlap between chunks. Each chunk retains the heading hierarchy it falls under as context.

from raghilda.chunker import MarkdownChunker

chunker = MarkdownChunker(
    chunk_size=1600,      # Target size in characters
    target_overlap=0.5,   # 50% overlap between chunks
)
chunked_doc = chunker.chunk(doc)
print(f"Created {len(chunked_doc.chunks)} chunks")
print(f"\nFirst chunk context: {chunked_doc.chunks[0].context}")
print(f"First chunk text:\n{chunked_doc.chunks[0].text[:200]}...")
Created 6 chunks

First chunk context: None
First chunk text:
# Guide – Quarto

# Guide

Comprehensive guide to using Quarto. If you are just starting out, you may want to explore the [tutorials](../../docs/get-started/index.html) to learn the basics.

#### Auth...

After chunking, upsert() embeds the chunks using the store’s embedding provider and writes them to the database.

Customizing Preparation

You can wrap your preferred reading and chunking logic in a helper that returns a chunked Document:

from raghilda.read import read_as_markdown
from raghilda.chunker import MarkdownChunker

chunker = MarkdownChunker(chunk_size=800, target_overlap=0.3)

def prepare(uri):
    doc = read_as_markdown(uri)
    return chunker.chunk(doc)

for link in links:
    store.upsert(prepare(link))

Common reasons to customize prepare:

  • Adjust chunk size or overlap — Smaller chunks for more precise retrieval, larger for more context.
  • Set hard heading boundaries — Use segment_by_heading_levels=[1, 2] to prevent chunks from crossing major sections.
  • Control HTML extraction — Pass html_extract_selectors or html_zap_selectors to read_as_markdown().
  • Use a different chunker — Any chunker that returns a chunked document will work. See the Chunking guide for more options, including chonkie integration.

Building Indexes

After ingestion, build indexes to speed up retrieval:

store.build_index()

This creates both a vector similarity index (HNSW) for semantic search and a BM25 index for keyword search.

Retrieving Content

Now you can search your knowledge base:

chunks = store.retrieve("How do I create a Quarto presentation?", top_k=5)

for chunk in chunks:
    print(f"Score: {chunk.metrics[0].value:.4f}")
    print(chunk.text[:200])
    print("---")
Score: 0.4994
# Guide – Quarto

# Guide

Comprehensive guide to using Quarto. If you are just starting out, you may want to explore the [tutorials](../../docs/get-started/index.html) to learn the basics.

#### Auth
---
Score: 0.4994
# Guide – Quarto

# Guide

Comprehensive guide to using Quarto. If you are just starting out, you may want to explore the [tutorials](../../docs/get-started/index.html) to learn the basics.

#### Auth
---

The retrieve() method combines vector similarity search (semantic matching) with BM25 (keyword matching) for hybrid retrieval. By default, overlapping chunks from the same document are merged (deoverlap=True) to provide more coherent results; metrics are concatenated and attribute values are aggregated in start-order lists on merged chunks. The merged chunk keeps context from the first overlapping chunk.

You can also use the individual search methods:

# Vector similarity search only
chunks = store.retrieve_vss("presentations", top_k=5)

# BM25 keyword search only
chunks = store.retrieve_bm25("presentations", top_k=5)

Using with an LLM

The retrieved chunks can augment your LLM prompts. Here’s an example using chatlas:

from chatlas import ChatOpenAI

# Connect to existing store
store = DuckDBStore.connect("quarto_docs.db", read_only=True)

# Define a search tool
def search_docs(query: str) -> str:
    """Search the Quarto documentation for relevant information."""
    import json
    chunks = store.retrieve(query, top_k=5, deoverlap=True)
    return json.dumps([{"text": chunk.text, "context": chunk.context} for chunk in chunks])

# Create chat with RAG tool
chat = ChatOpenAI(
    model="gpt-4o-mini",
    system_prompt="""Answer questions about Quarto using the search tool.
Always search the documentation before answering.""",
)
chat.register_tool(search_docs)

# Ask a question
chat.chat("How do I add citations to a Quarto document?")

Reconnecting to a Store

To reuse an existing store:

store = DuckDBStore.connect("quarto_docs.db")
print(f"Store contains {store.size()} chunks")

The embedding configuration is automatically restored, so you can immediately start retrieving.

Complete Example

Here’s the full workflow in one script:

from raghilda.store import DuckDBStore
from raghilda.embedding import EmbeddingOpenAI
from raghilda.scrape import find_links

# 1. Create store
store = DuckDBStore.create(
    location="quarto_docs.db",
    embed=EmbeddingOpenAI(),
    name="quarto",
    title="Quarto Documentation",
    overwrite=True,
)

# 2. Find documents
links = find_links(
    "https://quarto.org/docs/guide/",
    depth=1,
    children_only=True,
)

# 3. Prepare and insert documents
from raghilda.read import read_as_markdown
from raghilda.chunker import MarkdownChunker

chunker = MarkdownChunker()
for link in links:
    store.upsert(chunker.chunk(read_as_markdown(link)))

# 4. Build indexes
store.build_index()

# 5. Retrieve
chunks = store.retrieve("How do I create a presentation?", top_k=3)
for chunk in chunks:
    print(f"\n## {chunk.context}")
    print(chunk.text[:300])

Next Steps