Retrieval-Augmented Generation (RAG), while serving as a viable complement to large language models (LLMs), often overlooks the crucial aspect of text chunking within its pipeline, which impacts the quality of knowledge-intensive tasks. This paper introduces the concept of Meta-Chunking, which refers to a granularity between sentences and paragraphs, consisting of a collection of sentences within a paragraph that have deep linguistic logical connections. To implement Meta-Chunking, we designed two strategies based on LLMs: Margin Sampling Chunking and Perplexity Chunking. The former employs LLMs to perform binary classification on whether consecutive sentences need to be segmented, making decisions based on the probability difference obtained from margin sampling. The latter precisely identifies text chunk boundaries by analyzing the characteristics of perplexity distribution. Additionally, considering the inherent complexity of different texts, we propose a strategy that combines Meta-Chunking with dynamic merging to achieve a balance between fine-grained and coarse-grained text chunking. Experiments conducted on eleven datasets demonstrate that Meta-Chunking can more efficiently improve the performance of single-hop and multi-hop question answering based on RAG. For instance, on the 2WikiMultihopQA dataset, it outperforms similarity chunking by 1.32 while only consuming 45.8% of the time. Our code is available at https://github.com/IAAR-Shanghai/Meta-Chunking.
翻译:检索增强生成(RAG)作为大型语言模型(LLM)的一种有效补充,其流程中常忽视文本分块这一关键环节,从而影响知识密集型任务的质量。本文提出了元分块(Meta-Chunking)的概念,指介于句子与段落之间的粒度,由段落内具有深层语言逻辑关联的句子集合构成。为实现元分块,我们基于LLM设计了两种策略:边界采样分块法与困惑度分块法。前者利用LLM对连续句子是否需要分割进行二分类,依据边界采样获得的概率差进行决策;后者通过分析困惑度分布特征来精确识别文本块边界。此外,考虑到不同文本固有的复杂性,我们提出了一种将元分块与动态合并相结合的策略,以实现细粒度与粗粒度文本分块之间的平衡。在十一个数据集上进行的实验表明,元分块能更高效地提升基于RAG的单跳与多跳问答性能。例如,在2WikiMultihopQA数据集上,其性能优于相似性分块1.32个点,而时间消耗仅为后者的45.8%。我们的代码公开于https://github.com/IAAR-Shanghai/Meta-Chunking。