Recent advancements in large language models (LLMs) have allowed the augmentation of information retrieval (IR) pipelines with synthetic data in various ways. Yet, the main training paradigm remains: contrastive learning with binary relevance labels and the InfoNCE loss, where one positive document is compared against one or more negatives. This objective treats all documents that are not explicitly annotated as relevant on an equally negative footing, regardless of their actual degree of relevance, thus (a) missing subtle nuances that are useful for ranking and (b) being susceptible to annotation noise. To overcome this limitation, in this work we forgo real training documents and annotations altogether and use open-source LLMs to directly generate synthetic documents that answer real user queries according to several different levels of relevance. This fully synthetic ranking context of graduated relevance, together with an appropriate list-wise loss (Wasserstein distance), enables us to train dense retrievers in a way that better captures the ranking task. Experiments on various IR datasets show that our proposed approach outperforms conventional training with InfoNCE by a large margin. Without using any real documents for training, our dense retriever significantly outperforms the same retriever trained through self-supervision. More importantly, it matches the performance of the same retriever trained on real, labeled training documents of the same dataset, while being more robust to distribution shift and clearly outperforming it when evaluated zero-shot on the BEIR dataset collection.
翻译:近年来,大型语言模型(LLM)的进展使得信息检索(IR)流程能够以多种方式通过合成数据进行增强。然而,主要的训练范式依然未变:即使用二元相关性标签和InfoNCE损失进行对比学习,其中一个正例文档与一个或多个负例文档进行比较。这种目标函数将所有未明确标注为相关的文档置于同等负面的地位,而忽略了它们实际的相关性程度,因此(a)丢失了对排序有用的细微差别,并且(b)容易受到标注噪声的影响。为了克服这一局限,本研究完全摒弃了真实的训练文档与标注,转而利用开源LLM直接生成能够根据多个不同相关性级别回答真实用户查询的合成文档。这种具有分级相关性的全合成排序语境,结合适当的列表式损失(Wasserstein距离),使我们能够以更好地捕捉排序任务本质的方式训练稠密检索器。在多个IR数据集上的实验表明,我们提出的方法大幅超越了使用InfoNCE的传统训练方式。在不使用任何真实文档进行训练的情况下,我们的稠密检索器显著优于通过自监督训练的同一检索器。更重要的是,其性能与在同一数据集上使用真实标注训练文档训练的同一检索器相当,同时在分布偏移下表现出更强的鲁棒性,并且在BEIR数据集集合上进行零样本评估时明显优于后者。