Dense retrieval has become a prominent method to obtain relevant context or world knowledge in open-domain NLP tasks. When we use a learned dense retriever on a retrieval corpus at inference time, an often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence. We discover that the retrieval unit choice significantly impacts the performance of both retrieval and downstream tasks. Distinct from the typical approach of using passages or sentences, we introduce a novel retrieval unit, proposition, for dense retrieval. Propositions are defined as atomic expressions within text, each encapsulating a distinct factoid and presented in a concise, self-contained natural language format. We conduct an empirical comparison of different retrieval granularity. Our experiments reveal that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks. Moreover, constructing prompts with fine-grained retrieved units for retrieval-augmented language models improves the performance of downstream QA tasks given a specific computation budget.
翻译:稠密检索已成为开放域自然语言处理任务中获取相关上下文或世界知识的重要方法。当我们在推理阶段使用已学习的稠密检索器处理检索语料库时,一个常被忽视的设计选择是语料库建立索引时所采用的检索单元,例如文档、段落或句子。我们发现检索单元的选择会显著影响检索性能及下游任务表现。区别于通常使用的段落或句子单元,我们为稠密检索引入了一种新型检索单元——命题。命题被定义为文本中的原子表达式,每个命题封装一个独立的事实单元,并以简洁、自包含的自然语言形式呈现。我们对不同检索粒度进行了实证比较。实验表明,在检索任务中,使用命题等细粒度单元建立语料库索引显著优于段落级单元。此外,在特定计算预算下,为检索增强型语言模型构建包含细粒度检索单元的提示,能够提升下游问答任务的性能。