The work of neural retrieval so far focuses on ranking short texts and is challenged with long documents. There are many cases where the users want to find a relevant passage within a long document from a huge corpus, e.g. Wikipedia articles, research papers, etc. We propose and name this task \emph{Document-Aware Passage Retrieval} (DAPR). While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5\%) are due to missing document context. This drives us to build a benchmark for this task including multiple datasets from heterogeneous domains. In the experiments, we extend the SoTA passage retrievers with document context via (1) hybrid retrieval with BM25 and (2) contextualized passage representations, which inform the passage representation with document context. We find despite that hybrid retrieval performs the strongest on the mixture of the easy and the hard queries, it completely fails on the hard queries that require document-context understanding. On the other hand, contextualized passage representations (e.g. prepending document titles) achieve good improvement on these hard queries, but overall they also perform rather poorly. Our created benchmark enables future research on developing and comparing retrieval systems for the new task. The code and the data are available at https://github.com/UKPLab/arxiv2023-dapr.
翻译:神经检索工作目前主要聚焦于短文本排序,但在处理长文档时面临挑战。用户常需从海量语料库(如维基百科文章、研究论文等)的长文档中定位相关段落。我们提出并命名该任务为"文档感知段落检索"(DAPR)。在分析最先进(SoTA)段落检索器的错误时,我们发现主要错误(53.5%)源于文档上下文缺失。这促使我们构建了包含多领域异构数据集的基准测试集。实验中,我们通过以下两种方式扩展SoTA段落检索器以融入文档上下文:(1)基于BM25的混合检索;(2)上下文感知的段落表示,即利用文档上下文信息增强段落表征。结果表明,尽管混合检索在简单与困难查询混合场景下表现最佳,但在需要文档上下文理解的困难查询中完全失效。而上下文感知段落表示(例如在段落前添加文档标题)虽在这些困难查询上取得一定改进,但整体表现仍欠佳。本基准测试集为新型检索系统的研发与对比提供了基础。相关代码与数据已在https://github.com/UKPLab/arxiv2023-dapr 开源。