Retrieval-Augmented Generation (RAG), while serving as a viable complement to large language models (LLMs), often overlooks the crucial aspect of text chunking within its pipeline. This paper initially introduces a dual-metric evaluation method, comprising Boundary Clarity and Chunk Stickiness, to enable the direct quantification of chunking quality. Leveraging this assessment method, we highlight the inherent limitations of traditional and semantic chunking in handling complex contextual nuances, thereby substantiating the necessity of integrating LLMs into chunking process. To address the inherent trade-off between computational efficiency and chunking precision in LLM-based approaches, we devise the granularity-aware Mixture-of-Chunkers (MoC) framework, which consists of a three-stage processing mechanism. Notably, our objective is to guide the chunker towards generating a structured list of chunking regular expressions, which are subsequently employed to extract chunks from the original text. Extensive experiments demonstrate that both our proposed metrics and the MoC framework effectively settle challenges of the chunking task, revealing the chunking kernel while enhancing the performance of the RAG system.
翻译:检索增强生成(RAG)作为大型语言模型(LLMs)的有效补充,其流程中的文本分块环节往往被忽视。本文首先提出一种包含边界清晰度与块内聚性的双指标评估方法,以实现对分块质量的直接量化。基于此评估方法,我们揭示了传统分块与语义分块在处理复杂上下文语义时的固有局限,从而论证了将LLMs引入分块过程的必要性。为应对基于LLM的方法在计算效率与分块精度之间的固有权衡,我们设计了粒度感知的分块器混合框架,该框架包含三阶段处理机制。值得注意的是,我们的目标是引导分块器生成结构化的分块正则表达式列表,进而用于从原始文本中提取文本块。大量实验表明,我们提出的指标与MoC框架均能有效解决分块任务中的挑战,在揭示分块内核的同时显著提升了RAG系统的性能。