---------------------------------------------------------------------- This is the API documentation for the gdtest_long_names library. ---------------------------------------------------------------------- ## Document Stores Backend storage systems for documents and embeddings. BaseDocumentStore(connection_string: str) Abstract base class for document stores. Parameters ---------- connection_string Database connection string. DuckDBDocumentStore(connection_string: str, index_type: str = 'hnsw') DuckDB-backed document store with vector search. Parameters ---------- connection_string Database connection string. index_type Type of vector index to use. PostgreSQLDocumentStore(connection_string: str, embedding_dimension: int = 1536) PostgreSQL-backed document store with pgvector. Parameters ---------- connection_string Database connection string. embedding_dimension Dimensionality of embedding vectors. ## DuckDBDocumentStore Methods Methods for the DuckDBDocumentStore class upsert_documents(self, docs: list) -> int Insert or update documents in the store. ingest_from_directory(self, path: str) -> int Ingest all documents from a directory. retrieve_by_similarity(self, query: str, top_k: int = 10) -> list Retrieve documents by vector similarity search. retrieve_by_bm25_score(self, query: str, top_k: int = 10) -> list Retrieve documents using BM25 text scoring. retrieve_hybrid_combination(self, query: str, top_k: int = 10) -> list Retrieve using hybrid vector + BM25 combination. build_vector_index(self) -> None Build or rebuild the vector similarity index. get_collection_size(self) -> int Return the number of documents in the store. ## Embedding Providers Services for generating vector embeddings. EmbeddingProvider(model_name: str) Base class for embedding providers. Parameters ---------- model_name Name of the embedding model. OpenAIEmbeddingProvider(model_name: str = 'text-embedding-3-small', api_key: str = '') OpenAI embedding provider using text-embedding models. Parameters ---------- model_name Name of the OpenAI model. api_key OpenAI API key. CohereEmbeddingProvider(model_name: str = 'embed-english-v3.0', input_type: str = 'search_document') Cohere embedding provider with input type support. Parameters ---------- model_name Name of the Cohere model. input_type Type of input for embedding. ## Chunker Strategies Strategies for splitting documents into chunks. BaseChunkerStrategy(max_chunk_size: int = 1000, overlap_size: int = 200) Abstract base class for document chunking strategies. Parameters ---------- max_chunk_size Maximum size of each chunk in characters. overlap_size Number of overlapping characters between chunks. MarkdownChunkerStrategy(max_chunk_size: int = 1000, overlap_size: int = 200, preserve_code_blocks: bool = True) Markdown-aware chunking strategy that respects heading boundaries. Parameters ---------- max_chunk_size Maximum size of each chunk in characters. overlap_size Number of overlapping characters between chunks. preserve_code_blocks Whether to keep code blocks intact. ## Data Types Type definitions and result containers. RetrievedDocumentChunk(content: str, similarity_score: float, document_id: str) -> None A document chunk returned from a retrieval query. Parameters ---------- content The text content of the chunk. similarity_score Cosine similarity score (0 to 1). document_id Identifier of the source document. DocumentMetadataConfig(extract_title: bool = True, extract_author: bool = True, custom_metadata_fields: list = None) -> None Configuration for document metadata extraction. Parameters ---------- extract_title Whether to extract document titles. extract_author Whether to extract author information. custom_metadata_fields Additional metadata fields to extract. EmbeddingVectorResult(vectors: list, model_name: str, token_count: int) -> None Result container for embedding vector operations. Parameters ---------- vectors List of embedding vectors. model_name Name of the model used. token_count Total tokens processed. ## Plain Text Names Classes with long names containing no special characters. documentstorewithvectorsearchcapabilities(connectionstring: str, vectordimension: int = 1536) A store for documents supporting vector search. This class name is entirely lowercase with no separators, underscores, dots, or camelCase transitions. Parameters ---------- connectionstring Database connection string. vectordimension Dimensionality of stored vectors. EMBEDDINGPROVIDERWITHBATCHPROCESSINGSUPPORT(MODELIDENTIFIER: str, BATCHLIMIT: int = 100) All-uppercase embedding provider class. This class name is entirely uppercase with no separators, underscores, dots, or camelCase transitions. Parameters ---------- MODELIDENTIFIER Identifier for the embedding model. BATCHLIMIT Maximum batch size for processing. Chunkerstrategywithoverlapdetection(maxchunksize: int = 1000, overlapsize: int = 200) Initial-cap chunker strategy class. This class name starts with an uppercase letter and the rest is entirely lowercase, with no other separators. Parameters ---------- maxchunksize Maximum size of each chunk in characters. overlapsize Number of overlapping characters between chunks. ## documentstorewithvectorsearchcapabilities Methods Methods for the documentstorewithvectorsearchcapabilities class insertdocumentswithembeddings(self, docs: list) -> int Insert documents along with their embedding vectors. searchbyvectorsimilarity(self, query: str, topk: int = 10) -> list Search for documents by vector similarity. rebuildvectorsearchindex(self) -> None Rebuild the internal vector search index. deletedocumentsbyidentifier(self, docid: str) -> bool Delete a document by its unique identifier. countdocumentsincollection(self) -> int Return the total number of documents stored. exportcollectiontojsonlines(self, filepath: str) -> int Export all documents to a JSON Lines file. ## EMBEDDINGPROVIDERWITHBATCHPROCESSINGSUPPORT Methods Methods for the EMBEDDINGPROVIDERWITHBATCHPROCESSINGSUPPORT class GENERATEEMBEDDINGSFROMTEXTINPUT(self, texts: list) -> list Generate embeddings from a list of text inputs. CALCULATETOKENCOUNTFORTEXTS(self, texts: list) -> int Calculate total token count for the given texts. RETRIEVEMODELCONFIGURATION(self) -> dict Retrieve the current model configuration. VALIDATEINPUTTEXTLENGTHS(self, texts: list) -> bool Validate that all input texts are within length limits. EXPORTEMBEDDINGSTOFILE(self, filepath: str) -> int Export computed embeddings to a file. RESETINTERNALBATCHCOUNTER(self) -> None Reset the internal batch processing counter. ## Chunkerstrategywithoverlapdetection Methods Methods for the Chunkerstrategywithoverlapdetection class splitcontentintochunks(self, content: str) -> list Split document content into overlapping chunks. detectoverlapboundaries(self, content: str) -> list Detect optimal overlap boundary positions. mergeundersizedfragments(self, chunks: list) -> list Merge fragments that are too small to stand alone. calculateoverlappercentage(self, chunks: list) -> float Calculate the average overlap percentage between chunks. exportchunkswithoverlap(self, filepath: str) -> int Export chunks with overlap markers to a file. resetinternalchunkcache(self) -> None Reset the internal chunk processing cache.