The work of neural retrieval so far focuses on ranking short texts and is challenged with long documents. There are many cases where the users want to find a relevant passage within a long document from a huge corpus, e.g. Wikipedia articles, research papers, etc. We propose and name this task \emph{Document-Aware Passage Retrieval} (DAPR). While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5\%) are due to missing document context. This drives us to build a benchmark for this task including multiple datasets from heterogeneous domains. In the experiments, we extend the SoTA passage retrievers with document context via (1) hybrid retrieval with BM25 and (2) contextualized passage representations, which inform the passage representation with document context. We find despite that hybrid retrieval performs the strongest on the mixture of the easy and the hard queries, it completely fails on the hard queries that require document-context understanding. On the other hand, contextualized passage representations (e.g. prepending document titles) achieve good improvement on these hard queries, but overall they also perform rather poorly. Our created benchmark enables future research on developing and comparing retrieval systems for the new task. The code and the data are available at https://https://github.com/UKPLab/arxiv2023-dapr.
翻译:神经检索工作至今主要聚焦于短文本排序,但在处理长文档时面临挑战。许多场景下,用户需要从海量语料库(如维基百科文章、研究论文等)的长文档中定位相关段落。我们提出并命名该任务为"文档感知段落检索"(DAPR)。在分析当前最先进(SoTA)段落检索器的错误时,我们发现53.5%的主要错误源于缺失文档上下文。这促使我们为这一任务构建了一个涵盖异构领域多数据集的基准测试。实验中,我们通过(1)结合BM25的混合检索与(2)上下文感知段落表示(利用文档上下文增强段落表示)两种方式,将文档上下文融入SoTA段落检索器。研究发现,尽管混合检索在简单与困难查询的混合集上表现最佳,但在需要文档上下文理解的困难查询上完全失效。另一方面,上下文感知段落表示(例如前置文档标题)虽然在困难查询上取得显著提升,但整体性能仍然欠佳。我们构建的基准测试为这一新任务的检索系统开发与对比提供了研究基础。代码与数据可在https://github.com/UKPLab/arxiv2023-dapr获取。