Recently embedding-based retrieval or dense retrieval have shown state of the art results, compared with traditional sparse or bag-of-words based approaches. This paper introduces a model-agnostic doc-level embedding framework through large language model (LLM) augmentation. In addition, it also improves some important components in the retrieval model training process, such as negative sampling, loss function, etc. By implementing this LLM-augmented retrieval framework, we have been able to significantly improve the effectiveness of widely-used retriever models such as Bi-encoders (Contriever, DRAGON) and late-interaction models (ColBERTv2), thereby achieving state-of-the-art results on LoTTE datasets and BEIR datasets.
翻译:近期基于嵌入的检索(即稠密检索)相较于传统的稀疏或词袋方法展现出最先进性能。本文提出一种基于大语言模型增强的、与模型无关的文档级嵌入框架,同时改进了检索模型训练过程中的关键组件(如负采样、损失函数等)。通过实现该LLM增强检索框架,我们显著提升了广泛使用的检索器模型(如双编码器Contriever、DRAGON与延迟交互模型ColBERTv2)的有效性,从而在LoTTE数据集和BEIR数据集上取得了最先进的结果。