Revela: Dense Retriever Learning via Language Modeling

Dense retrievers play a vital role in accessing external and specialized knowledge to augment language models (LMs). Training dense retrievers typically requires annotated query-document pairs, which are costly to create and scarce in specialized domains (e.g., code) or in complex settings (e.g., requiring reasoning). These practical challenges have sparked growing interest in self-supervised retriever learning. Since LMs are trained to capture token-level dependencies through a self-supervised learning objective (i.e., next token prediction), we can analogously cast retrieval as learning dependencies among chunks of tokens. This analogy naturally leads to the question: How can we adapt self-supervised learning objectives in the spirit of language modeling to train retrievers? To answer this question, we introduce Revela, a unified and scalable training framework for self-supervised retriever learning via language modeling. Revela models semantic dependencies among documents by conditioning next token prediction on local and cross-document context through an in-batch attention mechanism. This attention is weighted by retriever-computed similarity scores, enabling the retriever to be optimized as part of language modeling. We evaluate Revela on domain-specific (CoIR), reasoning-intensive (BRIGHT), and general-domain (BEIR) benchmarks across various retriever backbones. Without annotated or synthetic query-document pairs, Revela surpasses larger supervised models and proprietary APIs on both CoIR and BRIGHT. It achieves BEIR's unsupervised SoTA with ~1000x less training data and 10x less compute. Performance increases with batch size and model size, highlighting Revela's scalability and its promise for self-supervised retriever learning.

翻译：稠密检索器在访问外部和专业化知识以增强语言模型方面发挥着至关重要的作用。训练稠密检索器通常需要带标注的查询-文档对，这类数据在特定领域（例如代码）或复杂场景（例如需要推理）中创建成本高昂且稀缺。这些实际挑战激发了人们对自监督检索器学习日益增长的兴趣。由于语言模型通过自监督学习目标（即下一词预测）来捕获词元级依赖关系，我们可以类似地将检索建模为学习文本块之间的依赖关系。这一类比自然引出一个问题：我们如何能借鉴语言建模的思想，调整自监督学习目标来训练检索器？为回答此问题，我们提出了Revela，一个通过语言建模进行自监督检索器学习的统一且可扩展的训练框架。Revela通过批次内注意力机制，基于局部和跨文档上下文对下一词预测进行条件建模，从而捕获文档间的语义依赖关系。该注意力的权重由检索器计算的相似度分数决定，使得检索器能够作为语言建模的一部分进行优化。我们在领域特定（CoIR）、推理密集型（BRIGHT）和通用领域（BEIR）基准测试中，针对多种检索器骨干网络评估了Revela。在无需标注或合成查询-文档对的情况下，Revela在CoIR和BRIGHT上均超越了更大的监督模型和专有API。它以约1000倍更少的训练数据和10倍更少的计算量，达到了BEIR的无监督最先进水平。性能随批次大小和模型规模的增加而提升，凸显了Revela的可扩展性及其在自监督检索器学习方面的潜力。