Adaptive Chunking: Optimizing Chunking-Method Selection for RAG

The effectiveness of Retrieval-Augmented Generation (RAG) is highly dependent on how documents are chunked, that is, segmented into smaller units for indexing and retrieval. Yet, commonly used "one-size-fits-all" approaches often fail to capture the nuanced structure and semantics of diverse texts. Despite its central role, chunking lacks a dedicated evaluation framework, making it difficult to assess and compare strategies independently of downstream performance. We challenge this paradigm by introducing Adaptive Chunking, a framework that selects the most suitable chunking strategy for each document based on a set of five novel intrinsic, document-based metrics: References Completeness (RC), Intrachunk Cohesion (ICC), Document Contextual Coherence (DCC), Block Integrity (BI), and Size Compliance (SC), which directly assess chunking quality across key dimensions. To support this framework, we also introduce two new chunkers, an LLM-regex splitter and a split-then-merge recursive splitter, alongside targeted post-processing techniques. On a diverse corpus spanning legal, technical, and social science domains, our metric-guided adaptive method significantly improves downstream RAG performance. Without changing models or prompts, our framework increases RAG outcomes, raising answers correctness to 72% (from 62-64%) and increasing the number of successfully answered questions by over 30% (65 vs. 49). These results demonstrate that adaptive, document-aware chunking, guided by a complementary suite of intrinsic metrics, offers a practical and effective path to more robust RAG systems. Code available at https://github.com/ekimetrics/adaptive-chunking.

翻译：检索增强生成（RAG）的有效性高度依赖于文档的分块方式，即将文档分割为更小的单元以便索引和检索。然而，常用的"一刀切"方法往往难以捕捉多样化文本的微妙结构和语义。尽管分块在RAG中扮演核心角色，但目前缺乏专门的评估框架，使得难以独立于下游性能来评估和比较不同策略。我们通过引入自适应分块（Adaptive Chunking）框架挑战了这一范式，该框架基于五项新颖的、文档内在的度量指标为每篇文档选择最合适的分块策略：引用完整性（RC）、块内凝聚性（ICC）、文档上下文连贯性（DCC）、块完整性（BI）和尺寸合规性（SC），这些指标从关键维度直接评估分块质量。为支持该框架，我们还引入了两种新的分块器——LLM正则表达式分割器和先分割后合并的递归分割器，并配合针对性的后处理技术。在涵盖法律、技术和社会科学领域的多样化语料库上，我们的度量引导自适应方法显著提升了RAG下游性能。在不改变模型或提示词的情况下，我们的框架将答案正确率提升至72%（相较于62-64%），成功回答的问题数量增加了超过30%（65个对比49个）。这些结果表明，由互补性内在度量指标引导的自适应、感知文档的分块方法，为构建更鲁棒的RAG系统提供了一条实用且有效的路径。代码见https://github.com/ekimetrics/adaptive-chunking。