Retrieval-Augmented Generation (RAG), while serving as a viable complement to large language models (LLMs), often overlooks the crucial aspect of text chunking within its pipeline, which impacts the quality of knowledge-intensive tasks. This paper introduces the concept of Meta-Chunking, which refers to a granularity between sentences and paragraphs, consisting of a collection of sentences within a paragraph that have deep linguistic logical connections. To implement Meta-Chunking, we designed Perplexity (PPL) Chunking, which balances performance and speed, and precisely identifies the boundaries of text chunks by analyzing the characteristics of context perplexity distribution. Additionally, considering the inherent complexity of different texts, we propose a strategy that combines PPL Chunking with dynamic merging to achieve a balance between fine-grained and coarse-grained text chunking. Experiments conducted on eleven datasets demonstrate that Meta-Chunking can more efficiently improve the performance of single-hop and multi-hop question answering based on RAG. For instance, on the 2WikiMultihopQA dataset, it outperforms similarity chunking by 1.32 while only consuming 45.8% of the time. Furthermore, through the analysis of models of various scales and types, we observed that PPL Chunking exhibits notable flexibility and adaptability. Our code is available at https://github.com/IAAR-Shanghai/Meta-Chunking.
翻译:检索增强生成(RAG)作为大型语言模型(LLM)的有效补充,其流程中常忽视文本分块这一关键环节,从而影响知识密集型任务的质量。本文提出元分块(Meta-Chunking)概念,指介于句子与段落之间的粒度单元,由段落内具有深层语言逻辑关联的句子集合构成。为实现元分块,我们设计了困惑度分块法(PPL Chunking),该方法在性能与速度间取得平衡,通过分析上下文困惑度分布特征精准定位文本块边界。此外,针对不同文本固有的复杂性,我们提出将PPL分块与动态合并相结合的策略,以实现细粒度与粗粒度文本分块间的平衡。在十一个数据集上的实验表明,元分块能更高效地提升基于RAG的单跳与多跳问答性能。例如在2WikiMultihopQA数据集上,其性能超越相似度分块法1.32个点,而耗时仅为其45.8%。通过对不同规模与类型模型的分析,我们进一步观察到PPL分块法具有显著的灵活性与适应性。代码已开源:https://github.com/IAAR-Shanghai/Meta-Chunking。