Breaking long documents into smaller segments is a fundamental challenge in information retrieval. Whether for search engines, question-answering systems, or retrieval-augmented generation (RAG), effective segmentation determines how well systems can locate and return relevant information. However, traditional methods, such as fixed-length or coherence-based segmentation, ignore user intent, leading to chunks that split answers or contain irrelevant noise. We introduce Intent-Driven Dynamic Chunking (IDC), a novel approach that uses predicted user queries to guide document segmentation. IDC leverages a Large Language Model to generate likely user intents for a document and then employs a dynamic programming algorithm to find the globally optimal chunk boundaries. This represents a novel application of DP to intent-aware segmentation that avoids greedy pitfalls. We evaluated IDC on six diverse question-answering datasets, including news articles, Wikipedia, academic papers, and technical documentation. IDC outperformed traditional chunking strategies on five datasets, improving top-1 retrieval accuracy by 5% to 67%, and matched the best baseline on the sixth. Additionally, IDC produced 40-60% fewer chunks than baseline methods while achieving 93-100% answer coverage. These results demonstrate that aligning document structure with anticipated information needs significantly boosts retrieval performance, particularly for long and heterogeneous documents.
翻译:将长文档分割为较小的片段是信息检索领域的一项基础性挑战。无论是对于搜索引擎、问答系统,还是检索增强生成(RAG)系统,有效的分割策略决定了系统定位并返回相关信息的能力。然而,传统方法(如基于固定长度或语义连贯性的分割)忽略了用户意图,导致产生的文本块可能割裂答案或包含无关噪声。本文提出意图驱动的动态分块(IDC)方法,这是一种利用预测的用户查询来指导文档分割的新颖方法。IDC利用大型语言模型生成文档可能对应的用户意图,然后采用动态规划算法寻找全局最优的分块边界。这代表了动态规划在意图感知分割中的创新应用,避免了贪婪策略的缺陷。我们在六个多样化的问答数据集上评估了IDC,涵盖新闻文章、维基百科、学术论文和技术文档。IDC在五个数据集上超越了传统分块策略,将Top-1检索准确率提升了5%至67%,并在第六个数据集上达到了最佳基线的水平。此外,IDC生成的分块数量比基线方法减少40-60%,同时实现了93-100%的答案覆盖率。这些结果表明,使文档结构与预期信息需求对齐能显著提升检索性能,尤其对于长文档和异构文档效果更为突出。