----------------------------------------------------------------------
This is the API documentation for the gdtest_long_names library.
----------------------------------------------------------------------


## Document Stores

Backend storage systems for documents and embeddings.


BaseDocumentStore(connection_string: str)

Abstract base class for document stores.

Parameters
----------
connection_string
    Database connection string.

DuckDBDocumentStore(connection_string: str, index_type: str = 'hnsw')

DuckDB-backed document store with vector search.

Parameters
----------
connection_string
    Database connection string.
index_type
    Type of vector index to use.

PostgreSQLDocumentStore(connection_string: str, embedding_dimension: int = 1536)

PostgreSQL-backed document store with pgvector.

Parameters
----------
connection_string
    Database connection string.
embedding_dimension
    Dimensionality of embedding vectors.


## DuckDBDocumentStore Methods

Methods for the DuckDBDocumentStore class


upsert_documents(self, docs: list) -> int

Insert or update documents in the store.

ingest_from_directory(self, path: str) -> int

Ingest all documents from a directory.

retrieve_by_similarity(self, query: str, top_k: int = 10) -> list

Retrieve documents by vector similarity search.

retrieve_by_bm25_score(self, query: str, top_k: int = 10) -> list

Retrieve documents using BM25 text scoring.

retrieve_hybrid_combination(self, query: str, top_k: int = 10) -> list

Retrieve using hybrid vector + BM25 combination.

build_vector_index(self) -> None

Build or rebuild the vector similarity index.

get_collection_size(self) -> int

Return the number of documents in the store.


## Embedding Providers

Services for generating vector embeddings.


EmbeddingProvider(model_name: str)

Base class for embedding providers.

Parameters
----------
model_name
    Name of the embedding model.

OpenAIEmbeddingProvider(model_name: str = 'text-embedding-3-small', api_key: str = '')

OpenAI embedding provider using text-embedding models.

Parameters
----------
model_name
    Name of the OpenAI model.
api_key
    OpenAI API key.

CohereEmbeddingProvider(model_name: str = 'embed-english-v3.0', input_type: str = 'search_document')

Cohere embedding provider with input type support.

Parameters
----------
model_name
    Name of the Cohere model.
input_type
    Type of input for embedding.


## Chunker Strategies

Strategies for splitting documents into chunks.


BaseChunkerStrategy(max_chunk_size: int = 1000, overlap_size: int = 200)

Abstract base class for document chunking strategies.

Parameters
----------
max_chunk_size
    Maximum size of each chunk in characters.
overlap_size
    Number of overlapping characters between chunks.

MarkdownChunkerStrategy(max_chunk_size: int = 1000, overlap_size: int = 200, preserve_code_blocks: bool = True)

Markdown-aware chunking strategy that respects heading boundaries.

Parameters
----------
max_chunk_size
    Maximum size of each chunk in characters.
overlap_size
    Number of overlapping characters between chunks.
preserve_code_blocks
    Whether to keep code blocks intact.


## Data Types

Type definitions and result containers.


RetrievedDocumentChunk(content: str, similarity_score: float, document_id: str) -> None

A document chunk returned from a retrieval query.

Parameters
----------
content
    The text content of the chunk.
similarity_score
    Cosine similarity score (0 to 1).
document_id
    Identifier of the source document.

DocumentMetadataConfig(extract_title: bool = True, extract_author: bool = True, custom_metadata_fields: list = None) -> None

Configuration for document metadata extraction.

Parameters
----------
extract_title
    Whether to extract document titles.
extract_author
    Whether to extract author information.
custom_metadata_fields
    Additional metadata fields to extract.

EmbeddingVectorResult(vectors: list, model_name: str, token_count: int) -> None

Result container for embedding vector operations.

Parameters
----------
vectors
    List of embedding vectors.
model_name
    Name of the model used.
token_count
    Total tokens processed.


## Plain Text Names

Classes with long names containing no special characters.


documentstorewithvectorsearchcapabilities(connectionstring: str, vectordimension: int = 1536)

A store for documents supporting vector search.

This class name is entirely lowercase with no separators,
underscores, dots, or camelCase transitions.

Parameters
----------
connectionstring
    Database connection string.
vectordimension
    Dimensionality of stored vectors.

EMBEDDINGPROVIDERWITHBATCHPROCESSINGSUPPORT(MODELIDENTIFIER: str, BATCHLIMIT: int = 100)

All-uppercase embedding provider class.

This class name is entirely uppercase with no separators,
underscores, dots, or camelCase transitions.

Parameters
----------
MODELIDENTIFIER
    Identifier for the embedding model.
BATCHLIMIT
    Maximum batch size for processing.

Chunkerstrategywithoverlapdetection(maxchunksize: int = 1000, overlapsize: int = 200)

Initial-cap chunker strategy class.

This class name starts with an uppercase letter and the rest
is entirely lowercase, with no other separators.

Parameters
----------
maxchunksize
    Maximum size of each chunk in characters.
overlapsize
    Number of overlapping characters between chunks.


## documentstorewithvectorsearchcapabilities Methods

Methods for the documentstorewithvectorsearchcapabilities class


insertdocumentswithembeddings(self, docs: list) -> int

Insert documents along with their embedding vectors.

searchbyvectorsimilarity(self, query: str, topk: int = 10) -> list

Search for documents by vector similarity.

rebuildvectorsearchindex(self) -> None

Rebuild the internal vector search index.

deletedocumentsbyidentifier(self, docid: str) -> bool

Delete a document by its unique identifier.

countdocumentsincollection(self) -> int

Return the total number of documents stored.

exportcollectiontojsonlines(self, filepath: str) -> int

Export all documents to a JSON Lines file.


## EMBEDDINGPROVIDERWITHBATCHPROCESSINGSUPPORT Methods

Methods for the EMBEDDINGPROVIDERWITHBATCHPROCESSINGSUPPORT class


GENERATEEMBEDDINGSFROMTEXTINPUT(self, texts: list) -> list

Generate embeddings from a list of text inputs.

CALCULATETOKENCOUNTFORTEXTS(self, texts: list) -> int

Calculate total token count for the given texts.

RETRIEVEMODELCONFIGURATION(self) -> dict

Retrieve the current model configuration.

VALIDATEINPUTTEXTLENGTHS(self, texts: list) -> bool

Validate that all input texts are within length limits.

EXPORTEMBEDDINGSTOFILE(self, filepath: str) -> int

Export computed embeddings to a file.

RESETINTERNALBATCHCOUNTER(self) -> None

Reset the internal batch processing counter.


## Chunkerstrategywithoverlapdetection Methods

Methods for the Chunkerstrategywithoverlapdetection class


splitcontentintochunks(self, content: str) -> list

Split document content into overlapping chunks.

detectoverlapboundaries(self, content: str) -> list

Detect optimal overlap boundary positions.

mergeundersizedfragments(self, chunks: list) -> list

Merge fragments that are too small to stand alone.

calculateoverlappercentage(self, chunks: list) -> float

Calculate the average overlap percentage between chunks.

exportchunkswithoverlap(self, filepath: str) -> int

Export chunks with overlap markers to a file.

resetinternalchunkcache(self) -> None

Reset the internal chunk processing cache.