chunker.MarkdownChunker

Chunk Markdown documents into overlapping segments at semantic boundaries.

Usage

chunker.MarkdownChunker(
    chunk_size=1600,
    target_overlap=0.5,
    *,
    max_snap_distance=20,
    segment_by_heading_levels=None
)

This chunker divides Markdown text into smaller, overlapping chunks while intelligently positioning cut points at semantic boundaries like headings, paragraphs, sentences, and words. Rather than cutting rigidly at character counts, it nudges cut points to the nearest sensible boundary, producing more semantically coherent chunks suitable for RAG applications.

Parameters

chunk_size: int = 1600: Target size for each chunk in characters. The chunker attempts to create chunks near this size, though actual sizes may vary based on semantic boundaries. Default is 1600 characters.
target_overlap: float = 0.5: Fraction of overlap between successive chunks, from 0 to 1. Default is 0.5 (50% overlap). Even with 0, some overlap may occur because the last chunk is anchored to the document end.
max_snap_distance: int = 20: Maximum distance (in characters) to move a cut point to reach a semantic boundary. If no boundary is found within this distance, the cut point stays at its original position. Default is 20.
segment_by_heading_levels: Optional[list[int]] = None: List of heading levels (1-6) that act as hard boundaries. When specified, no chunk will cross these headings, and segments between them are chunked independently. For example, [1, 2] ensures chunks never span across h1 or h2 headings.

Examples

The example below chunks a short Markdown document with a deliberately small chunk_size so the splitting is easy to see. With segment_by_heading_levels=[1, 2], no chunk crosses an h1 or h2, so each section is chunked on its own:

from raghilda.chunker import MarkdownChunker

chunker = MarkdownChunker(
    chunk_size=100,
    target_overlap=0.2,
    segment_by_heading_levels=[1, 2],
)

text = '''# Introduction
This is the introduction section with some content.

## Background
Here is background information that provides context.

## Methods
The methods section describes our approach.
'''

chunks = chunker.chunk_text(text)
for chunk in chunks:
    print(f"[{chunk.start_index}:{chunk.end_index}] {chunk.text[:40]}...")

[0:68] # Introduction
This is the introduction ...
[68:137] ## Background
Here is background informa...
[137:192] ## Methods
The methods section describes...

Each printed line shows a chunk’s character span (start_index:end_index) and the start of its text. To chunk a whole document instead of a bare string (preserving its origin and metadata), use chunk(), which returns a ChunkedDocument.

Notes

The chunking algorithm works as follows:

Parse the Markdown to identify semantic boundaries (headings, paragraphs, sentences, lines, words)
If segment_by_heading_levels is set, split the document at those headings first
For each segment, calculate target chunk boundaries based on chunk_size and target_overlap
Snap each boundary to the nearest semantic boundary (preferring headings > paragraphs > sentences > lines > words)
Extract chunks with their positional information and heading context