In this work, we explore a Multilingual Information Retrieval (MLIR) task, where the collection includes documents in multiple languages. We demonstrate that applying state-of-the-art approaches developed for cross-lingual information retrieval to MLIR tasks leads to sub-optimal performance. This is due to the heterogeneous and imbalanced nature of multilingual collections -- some languages are better represented in the collection and some benefit from large-scale training data. To address this issue, we present KD-SPD, a novel soft prompt decoding approach for MLIR that implicitly "translates" the representation of documents in different languages into the same embedding space. To address the challenges of data scarcity and imbalance, we introduce a knowledge distillation strategy. The teacher model is trained on rich English retrieval data, and by leveraging bi-text data, our distillation framework transfers its retrieval knowledge to the multilingual document encoder. Therefore, our approach does not require any multilingual retrieval training data. Extensive experiments on three MLIR datasets with a total of 15 languages demonstrate that KD-SPD significantly outperforms competitive baselines in all cases. We conduct extensive analyses to show that our method has less language bias and better zero-shot transfer ability towards new languages.
翻译:在这项工作中,我们探索了一项多语言信息检索任务,其中文档集合包含多种语言。我们证明,将针对跨语言信息检索开发的最先进方法应用于多语言信息检索任务会导致性能欠佳。这是因为多语言集合具有异质性和不平衡性——某些语言在集合中具有更好的代表性,而另一些语言则受益于大规模训练数据。为解决这一问题,我们提出了KD-SPD,一种新颖的软提示解码方法,用于多语言信息检索,该方法隐式“翻译”不同语言文档的表示,将其映射到同一嵌入空间。为应对数据稀缺和不平衡的挑战,我们引入了一种知识蒸馏策略。教师模型在丰富的英文检索数据上训练,通过利用双文本数据,我们的蒸馏框架将其检索知识传递给多语言文档编码器。因此,我们的方法无需任何多语言检索训练数据。在三个多语言信息检索数据集上(共涵盖15种语言)的广泛实验表明,KD-SPD在所有情况下均显著优于竞争基线。我们进行了大量分析,证明我们的方法具有更小的语言偏差,并且对未见过的新语言具有更好的零样本迁移能力。