Prior work on English monolingual retrieval has shown that a cross-encoder trained using a large number of relevance judgments for query-document pairs can be used as a teacher to train more efficient, but similarly effective, dual-encoder student models. Applying a similar knowledge distillation approach to training an efficient dual-encoder model for Cross-Language Information Retrieval (CLIR), where queries and documents are in different languages, is challenging due to the lack of a sufficiently large training collection when the query and document languages differ. The state of the art for CLIR thus relies on translating queries, documents, or both from the large English MS MARCO training set, an approach called Translate-Train. This paper proposes an alternative, Translate-Distill, in which knowledge distillation from either a monolingual cross-encoder or a CLIR cross-encoder is used to train a dual-encoder CLIR student model. This richer design space enables the teacher model to perform inference in an optimized setting, while training the student model directly for CLIR. Trained models and artifacts are publicly available on Huggingface.
翻译:先前关于英语单语检索的研究表明,使用大量查询-文档对的相关性判断训练得到的交叉编码器,可以作为教师模型来训练更高效但效果相近的双编码器学生模型。将类似的知识蒸馏方法应用于跨语言信息检索(CLIR)中训练高效双编码器模型时,由于查询语言与文档语言不同时缺乏足够大的训练集,这一方法面临挑战。当前CLIR的先进技术依赖于从大规模英语MS MARCO训练集翻译查询、文档或两者,即"翻译-训练"方法。本文提出另一种替代方案"翻译-蒸馏",该方法利用单语交叉编码器或CLIR交叉编码器的知识蒸馏来训练双编码器CLIR学生模型。这种更丰富的设计空间使得教师模型能在优化设置下进行推理,同时直接针对CLIR训练学生模型。训练好的模型与制品已在Huggingface上公开提供。