Dense document embeddings are central to neural retrieval. The dominant paradigm is to train and construct embeddings by running encoders directly on individual documents. In this work, we argue that these embeddings, while effective, are implicitly out-of-context for targeted use cases of retrieval, and that a contextualized document embedding should take into account both the document and neighboring documents in context - analogous to contextualized word embeddings. We propose two complementary methods for contextualized document embeddings: first, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss; second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation. Results show that both methods achieve better performance than biencoders in several settings, with differences especially pronounced out-of-domain. We achieve state-of-the-art results on the MTEB benchmark with no hard negative mining, score distillation, dataset-specific instructions, intra-GPU example-sharing, or extremely large batch sizes. Our method can be applied to improve performance on any contrastive learning dataset and any biencoder.
翻译:稠密文档嵌入是神经检索的核心。主流范式是通过直接在单个文档上运行编码器来训练和构建嵌入。本文认为,这些嵌入虽然有效,但对于检索的特定用例而言,其本质上是脱离上下文的;一个上下文感知的文档嵌入应同时考虑文档本身及其上下文中的邻近文档——这与上下文感知词嵌入的理念类似。我们提出了两种互补的上下文感知文档嵌入方法:首先,一种替代的对比学习目标,明确将文档邻近信息纳入批内上下文损失;其次,一种新的上下文架构,显式地将邻近文档信息编码到表征中。实验结果表明,在多种设定下,两种方法均优于双编码器,尤其在跨域场景中差异更为显著。我们在MTEB基准测试中取得了最先进的结果,且无需困难负例挖掘、分数蒸馏、数据集特定指令、GPU内样本共享或极大的批处理规模。本方法可应用于提升任何对比学习数据集及任何双编码器的性能。