Recent work in cross-language information retrieval (CLIR), where queries and documents are in different languages, has shown the benefit of the Translate-Distill framework that trains a cross-language neural dual-encoder model using translation and distillation. However, Translate-Distill only supports a single document language. Multilingual information retrieval (MLIR), which ranks a multilingual document collection, is harder to train than CLIR because the model must assign comparable relevance scores to documents in different languages. This work extends Translate-Distill and propose Multilingual Translate-Distill (MTD) for MLIR. We show that ColBERT-X models trained with MTD outperform their counterparts trained ith Multilingual Translate-Train, which is the previous state-of-the-art training approach, by 5% to 25% in nDCG@20 and 15% to 45% in MAP. We also show that the model is robust to the way languages are mixed in training batches. Our implementation is available on GitHub.
翻译:跨语言信息检索(CLIR,即查询与文档分属不同语言)的最新研究表明,采用翻译与蒸馏框架训练的跨语言神经双编码器模型效果显著。然而,Translate-Distill框架仅支持单一文档语言。多语言信息检索(MLIR)需对多语言文档集合进行排序,其训练难度高于CLIR,因为模型必须为不同语言的文档赋予可比较的相关性评分。本研究扩展了Translate-Distill框架,提出面向MLIR的多语言翻译蒸馏(MTD)方法。实验证明,经MTD训练的ColBERT-X模型在nDCG@20指标上较此前最先进的训练方法(多语言翻译训练)提升5%至25%,在MAP指标上提升15%至45%。此外,该模型对训练批次中语言混合方式具有鲁棒性。我们的实现代码已开源至GitHub。