Modern NLP tasks increasingly rely on dense retrieval methods to access up-to-date and relevant contextual information. We are motivated by the premise that retrieval benefits from segments that can vary in size such that a content's semantic independence is better captured. We propose LumberChunker, a method leveraging an LLM to dynamically segment documents, which iteratively prompts the LLM to identify the point within a group of sequential passages where the content begins to shift. To evaluate our method, we introduce GutenQA, a benchmark with 3000 "needle in a haystack" type of question-answer pairs derived from 100 public domain narrative books available on Project Gutenberg. Our experiments show that LumberChunker not only outperforms the most competitive baseline by 7.37% in retrieval performance (DCG@20) but also that, when integrated into a RAG pipeline, LumberChunker proves to be more effective than other chunking methods and competitive baselines, such as the Gemini 1.5M Pro. Our Code and Data are available at https://github.com/joaodsmarques/LumberChunker
翻译:现代自然语言处理任务日益依赖密集检索方法来获取最新且相关的上下文信息。我们的研究动机基于这样一个前提:检索受益于大小可变的文本片段,从而能更好地捕捉内容的语义独立性。我们提出了LumberChunker,一种利用大语言模型动态分割文档的方法,该方法通过迭代提示大语言模型来识别连续段落组中内容开始发生转变的位置。为评估我们的方法,我们引入了GutenQA基准测试集,该数据集包含3000个“大海捞针”式问答对,源自古登堡计划中100本公共领域叙事书籍。实验表明,LumberChunker不仅在检索性能(DCG@20)上超越最具竞争力的基线方法7.37%,而且当集成到检索增强生成流程中时,LumberChunker被证明比其他分块方法及竞争基线(如Gemini 1.5M Pro)更为有效。我们的代码与数据公开于 https://github.com/joaodsmarques/LumberChunker