DAPR: A Benchmark on Document-Aware Passage Retrieval

Recent neural retrieval mainly focuses on ranking short texts and is challenged with long documents. Existing work mainly evaluates either ranking passages or whole documents. However, there are many cases where the users want to find a relevant passage within a long document from a huge corpus, e.g. legal cases, research papers, etc. In this scenario, the passage often provides little document context and thus challenges the current approaches to finding the correct document and returning accurate results. To fill this gap, we propose and name this task Document-Aware Passage Retrieval (DAPR) and build a benchmark including multiple datasets from various domains, covering both DAPR and whole-document retrieval. In experiments, we extend the state-of-the-art neural passage retrievers with document-level context via different approaches including prepending document summary, pooling over passage representations, and hybrid retrieval with BM25. The hybrid-retrieval systems, the overall best, can only improve on the DAPR tasks marginally while significantly improving on the document-retrieval tasks. This motivates further research in developing better retrieval systems for the new task. The code and the data are available at https://github.com/kwang2049/dapr

翻译：近期神经检索主要聚焦于短文本排序，但在处理长文档时面临挑战。现有研究主要评估段落排序或整文档检索，然而许多场景下（如法律案例、研究论文等），用户需要从海量语料库的长文档中定位相关段落。此类场景中，段落往往缺乏文档级上下文信息，给现有方法带来准确定位文档并返回精确结果的挑战。为填补这一空白，我们提出并定义了"文档感知段落检索"（DAPR）任务，构建了涵盖多领域数据集（包括DAPR与整文档检索任务）的基准测试。实验阶段，我们通过文档摘要前置、段落表征池化及BM25混合检索等方法，将文档级上下文融入当前最优神经段落检索模型。混合检索系统虽在文档检索任务上表现显著提升，但在DAPR任务中仅获得边际改进，这激励学界针对该新任务开发更优检索系统。相关代码与数据已开源至 https://github.com/kwang2049/dapr

相关内容

CASES

关注 4

CASES：International Conference on Compilers, Architectures, and Synthesis for Embedded Systems。 Explanation：嵌入式系统编译器、体系结构和综合国际会议。 Publisher：ACM。 SIT： http://dblp.uni-trier.de/db/conf/cases/index.html

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日