Dense retrievers play a vital role in accessing external and specialized knowledge to augment language models (LMs). Training dense retrievers typically requires annotated query-document pairs, which are costly to create and scarce in specialized domains (e.g., code) or in complex settings (e.g., requiring reasoning). These practical challenges have sparked growing interest in self-supervised retriever learning. Since LMs are trained to capture token-level dependencies through a self-supervised learning objective (i.e., next token prediction), we can analogously cast retrieval as learning dependencies among chunks of tokens. This analogy naturally leads to the question: How can we adapt self-supervised learning objectives in the spirit of language modeling to train retrievers? To answer this question, we introduce Revela, a unified and scalable training framework for self-supervised retriever learning via language modeling. Revela models semantic dependencies among documents by conditioning next token prediction on local and cross-document context through an in-batch attention mechanism. This attention is weighted by retriever-computed similarity scores, enabling the retriever to be optimized as part of language modeling. We evaluate Revela on domain-specific (CoIR), reasoning-intensive (BRIGHT), and general-domain (BEIR) benchmarks across various retriever backbones. Without annotated or synthetic query-document pairs, Revela surpasses larger supervised models and proprietary APIs on CoIR and matches them on BRIGHT. It achieves BEIR's unsupervised SoTA with ~ 1000x less training data and 10x less compute. Performance increases with batch size and model size, highlighting Revela's scalability and its promise for self-supervised retriever learning.
翻译:稠密检索器在获取外部和专业化知识以增强语言模型方面发挥着至关重要的作用。训练稠密检索器通常需要带标注的查询-文档对,这些数据在特定领域(例如代码)或复杂场景(例如需要推理)中创建成本高昂且稀缺。这些实际挑战激发了人们对自监督检索器学习日益增长的兴趣。由于语言模型通过自监督学习目标(即下一词预测)来捕获词元级依赖关系,我们可以类似地将检索任务视为学习文本块之间的依赖关系。这种类比自然引出一个问题:我们如何能够借鉴语言建模的思想,调整自监督学习目标来训练检索器?为回答此问题,我们提出了Revela,一个基于语言建模的统一且可扩展的自监督检索器学习训练框架。Revela通过批内注意力机制,基于局部和跨文档上下文对下一词预测进行条件建模,从而捕捉文档间的语义依赖关系。该注意力的权重由检索器计算的相似度分数决定,使得检索器能够作为语言建模的一部分被优化。我们在特定领域(CoIR)、推理密集型(BRIGHT)和通用领域(BEIR)基准测试上,针对多种检索器骨干网络评估了Revela。在无需标注或合成查询-文档对的情况下,Revela在CoIR上超越了更大的监督模型和专有API,在BRIGHT上与之匹敌。它以约1000倍更少的训练数据和10倍更少的计算量,达到了BEIR的无监督最先进水平。其性能随批次大小和模型规模提升而增强,凸显了Revela的可扩展性及其在自监督检索器学习方面的潜力。