Representation-based retrieval models, so-called biencoders, estimate the relevance of a document to a query by calculating the similarity of their respective embeddings. Current state-of-the-art biencoders are trained using an expensive training regime involving knowledge distillation from a teacher model and batch-sampling. Instead of relying on a teacher model, we contribute a novel parameter-free loss function for self-supervision that exploits the pre-trained language modeling capabilities of the encoder model as a training signal, eliminating the need for batch sampling by performing implicit hard negative mining. We investigate the capabilities of our proposed approach through extensive ablation studies, demonstrating that self-distillation can match the effectiveness of teacher distillation using only 13.5% of the data, while offering a speedup in training time between 3x and 15x compared to parametrized losses. Code and data is made openly available.
翻译:基于表示的检索模型(即双编码器)通过计算查询与文档各自嵌入向量的相似度来评估文档的相关性。当前最先进的双编码器采用昂贵的训练机制,涉及从教师模型进行知识蒸馏以及批量采样。本文提出一种无需依赖教师模型的新型无参数损失函数,通过利用编码器模型的预训练语言建模能力作为训练信号,实现自监督学习,并借助隐式硬负例挖掘技术消除了批量采样的需求。我们通过大量消融实验验证了所提方法的性能,结果表明自蒸馏仅需13.5%的数据即可达到教师蒸馏的效果,且相较于参数化损失函数,训练速度可提升3至15倍。相关代码与数据已开源发布。