Dense retrieval has become a prominent method to obtain relevant context or world knowledge in open-domain NLP tasks. When we use a learned dense retriever on a retrieval corpus at inference time, an often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence. We discover that the retrieval unit choice significantly impacts the performance of both retrieval and downstream tasks. Distinct from the typical approach of using passages or sentences, we introduce a novel retrieval unit, proposition, for dense retrieval. Propositions are defined as atomic expressions within text, each encapsulating a distinct factoid and presented in a concise, self-contained natural language format. We conduct an empirical comparison of different retrieval granularity. Our results reveal that proposition-based retrieval significantly outperforms traditional passage or sentence-based methods in dense retrieval. Moreover, retrieval by proposition also enhances the performance of downstream QA tasks, since the retrieved texts are more condensed with question-relevant information, reducing the need for lengthy input tokens and minimizing the inclusion of extraneous, irrelevant information.
翻译:密集检索已成为在开放域自然语言处理任务中获取相关上下文或世界知识的主流方法。当我们在推理阶段对检索语料库使用学习得到的密集检索器时,一个常被忽视的设计选择是语料库索引所采用的检索单元,例如文档、段落或句子。我们发现,检索单元的选择对检索性能和下游任务性能均有显著影响。与通常使用段落或句子的传统方法不同,我们提出了一种用于密集检索的新型检索单元——命题。命题被定义为文本中的原子性表达,每个命题封装一个独立的事实信息,并以简洁、自包含的自然语言形式呈现。我们对不同检索粒度进行了实证比较。结果表明,基于命题的检索在密集检索中显著优于传统的段落或句子方法。此外,由于检索到的文本更凝练地包含与问题相关的信息,减少了长输入标记的需求,并最大限度降低了无关信息的干扰,基于命题的检索还提升了下游问答任务的性能。