Dense document embeddings are central to neural retrieval. The dominant paradigm is to train and construct embeddings by running encoders directly on individual documents. In this work, we argue that these embeddings, while effective, are implicitly out-of-context for targeted use cases of retrieval, and that a contextualized document embedding should take into account both the document and neighboring documents in context - analogous to contextualized word embeddings. We propose two complementary methods for contextualized document embeddings: first, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss; second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation. Results show that both methods achieve better performance than biencoders in several settings, with differences especially pronounced out-of-domain. We achieve state-of-the-art results on the MTEB benchmark with no hard negative mining, score distillation, dataset-specific instructions, intra-GPU example-sharing, or extremely large batch sizes. Our method can be applied to improve performance on any contrastive learning dataset and any biencoder.
翻译:稠密文档嵌入是神经检索的核心。当前的主流范式是通过直接在单个文档上运行编码器来训练和构建嵌入。本文认为,虽然这些嵌入是有效的,但对于检索的特定使用场景而言,它们隐含地缺乏上下文信息,而一个上下文化的文档嵌入应当同时考虑文档本身及其上下文中的相邻文档——这与上下文化的词嵌入类似。我们提出了两种互补的上下文文档嵌入方法:首先,一种替代的对比学习目标,它明确地将文档邻居纳入批次内的上下文损失;其次,一种新的上下文架构,它明确地将邻居文档信息编码到表示中。实验结果表明,在多种设置下,这两种方法均优于双编码器,其优势在领域外场景中尤为明显。我们在MTEB基准测试中取得了最先进的结果,且无需进行困难负样本挖掘、分数蒸馏、数据集特定指令、GPU内样本共享或使用极大的批次大小。我们的方法可应用于改进任何对比学习数据集和任何双编码器的性能。