In monolingual dense retrieval, lots of works focus on how to distill knowledge from cross-encoder re-ranker to dual-encoder retriever and these methods achieve better performance due to the effectiveness of cross-encoder re-ranker. However, we find that the performance of the cross-encoder re-ranker is heavily influenced by the number of training samples and the quality of negative samples, which is hard to obtain in the cross-lingual setting. In this paper, we propose to use a query generator as the teacher in the cross-lingual setting, which is less dependent on enough training samples and high-quality negative samples. In addition to traditional knowledge distillation, we further propose a novel enhancement method, which uses the query generator to help the dual-encoder align queries from different languages, but does not need any additional parallel sentences. The experimental results show that our method outperforms the state-of-the-art methods on two benchmark datasets.
翻译:在单语言稠密检索中,大量工作聚焦于如何从交叉编码器重排序器向双编码器检索器蒸馏知识,得益于交叉编码器重排序器的有效性,此类方法取得了更优性能。然而,我们发现交叉编码器重排序器的性能高度依赖于训练样本数量与负样本质量,这在跨语言场景下难以满足。本文提出在跨语言场景下将查询生成器作为教师模型,该方法对充足训练样本与高质量负样本的依赖度更低。除传统知识蒸馏外,我们进一步提出一种新型增强方法,利用查询生成器帮助双编码器对齐不同语言的查询,且无需任何额外的平行句。实验结果表明,我们的方法在两个基准数据集上均超越了当前最先进的方法。