Transferring information retrieval (IR) models from a high-resource language (typically English) to other languages in a zero-shot fashion has become a widely adopted approach. In this work, we show that the effectiveness of zero-shot rankers diminishes when queries and documents are present in different languages. Motivated by this, we propose to train ranking models on artificially code-switched data instead, which we generate by utilizing bilingual lexicons. To this end, we experiment with lexicons induced from (1) cross-lingual word embeddings and (2) parallel Wikipedia page titles. We use the mMARCO dataset to extensively evaluate reranking models on 36 language pairs spanning Monolingual IR (MoIR), Cross-lingual IR (CLIR), and Multilingual IR (MLIR). Our results show that code-switching can yield consistent and substantial gains of 5.1 MRR@10 in CLIR and 3.9 MRR@10 in MLIR, while maintaining stable performance in MoIR. Encouragingly, the gains are especially pronounced for distant languages (up to 2x absolute gain). We further show that our approach is robust towards the ratio of code-switched tokens and also extends to unseen languages. Our results demonstrate that training on code-switched data is a cheap and effective way of generalizing zero-shot rankers for cross-lingual and multilingual retrieval.
翻译:将信息检索模型从高资源语言(通常为英语)以零样本方式迁移至其他语言已成为广泛应用的方法。本研究表明,当查询与文档分属不同语言时,零样本排序器的有效性会显著下降。受此启发,我们提出利用双语词典生成人工代码混编数据,并在此类数据上训练排序模型。为此,我们实验了从(1)跨语言词嵌入和(2)平行维基百科页面标题中抽取的双语词典。基于mMARCO数据集,我们对涵盖单语言检索、跨语言检索与多语言检索的36个语言对的重排序模型进行系统评估。结果表明,代码混编可在保持单语言检索稳定性能的同时,为跨语言检索带来5.1 MRR@10的显著提升,为多语言检索带来3.9 MRR@10的增益。令人振奋的是,对于远距离语言,性能增益尤为突出(绝对提升可达2倍)。我们进一步证明,该方法对代码混编标记比例具有鲁棒性,并可泛化至未见语言。研究结果证实,基于代码混编数据的训练是一种低成本且有效的途径,可提升零样本排序器在跨语言与多语言检索中的泛化能力。