LumberChunker: Long-Form Narrative Document Segmentation

Modern NLP tasks increasingly rely on dense retrieval methods to access up-to-date and relevant contextual information. We are motivated by the premise that retrieval benefits from segments that can vary in size such that a content's semantic independence is better captured. We propose LumberChunker, a method leveraging an LLM to dynamically segment documents, which iteratively prompts the LLM to identify the point within a group of sequential passages where the content begins to shift. To evaluate our method, we introduce GutenQA, a benchmark with 3000 "needle in a haystack" type of question-answer pairs derived from 100 public domain narrative books available on Project Gutenberg. Our experiments show that LumberChunker not only outperforms the most competitive baseline by 7.37% in retrieval performance (DCG@20) but also that, when integrated into a RAG pipeline, LumberChunker proves to be more effective than other chunking methods and competitive baselines, such as the Gemini 1.5M Pro. Our Code and Data are available at https://github.com/joaodsmarques/LumberChunker

翻译：现代自然语言处理任务日益依赖密集检索方法来获取最新且相关的上下文信息。我们的研究动机基于这样一个前提：检索受益于大小可变的文本片段，从而能更好地捕捉内容的语义独立性。我们提出了LumberChunker，一种利用大语言模型动态分割文档的方法，该方法通过迭代提示大语言模型来识别连续段落组中内容开始发生转变的位置。为评估我们的方法，我们引入了GutenQA基准测试集，该数据集包含3000个“大海捞针”式问答对，源自古登堡计划中100本公共领域叙事书籍。实验表明，LumberChunker不仅在检索性能（DCG@20）上超越最具竞争力的基线方法7.37%，而且当集成到检索增强生成流程中时，LumberChunker被证明比其他分块方法及竞争基线（如Gemini 1.5M Pro）更为有效。我们的代码与数据公开于 https://github.com/joaodsmarques/LumberChunker

相关内容

大语言模型

关注 67

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日