The work of neural retrieval so far focuses on ranking short texts and is challenged with long documents. There are many cases where the users want to find a relevant passage within a long document from a huge corpus, e.g. Wikipedia articles, research papers, etc. We propose and name this task \emph{Document-Aware Passage Retrieval} (DAPR). While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5\%) are due to missing document context. This drives us to build a benchmark for this task including multiple datasets from heterogeneous domains. In the experiments, we extend the SoTA passage retrievers with document context via (1) hybrid retrieval with BM25 and (2) contextualized passage representations, which inform the passage representation with document context. We find despite that hybrid retrieval performs the strongest on the mixture of the easy and the hard queries, it completely fails on the hard queries that require document-context understanding. On the other hand, contextualized passage representations (e.g. prepending document titles) achieve good improvement on these hard queries, but overall they also perform rather poorly. Our created benchmark enables future research on developing and comparing retrieval systems for the new task. The code and the data are available at https://github.com/UKPLab/arxiv2023-dapr.
翻译:神经检索的研究迄今为止主要聚焦于短文本排序,但在处理长文档时面临挑战。用户常常需要从海量语料库(如维基百科文章、研究论文等)的长文档中定位相关段落。我们提出并将这一任务定义为\emph{文档感知段落检索}(DAPR)。通过分析当前最先进段落检索器的错误,我们发现主要错误(53.5%)源于缺失文档上下文。这促使我们构建了一个包含多个异构领域数据集的该任务基准。实验中,我们通过以下两种方式为最先进的段落检索器扩展文档上下文:(1) 结合BM25的混合检索;(2) 采用上下文感知的段落表示,使段落表征融入文档上下文。研究发现,尽管混合检索在简单与困难查询混合场景中表现最佳,但在需要文档上下文理解的困难查询上完全失效。相反,上下文感知的段落表示(如前置文档标题)在困难查询上取得显著改进,但整体表现仍欠佳。我们构建的基准为未来开发并对比该新任务的检索系统提供了基础。代码与数据见https://github.com/UKPLab/arxiv2023-dapr。