Boosting Zero-shot Cross-lingual Retrieval by Training on Artificially Code-Switched Data

Transferring information retrieval (IR) models from a high-resource language (typically English) to other languages in a zero-shot fashion has become a widely adopted approach. In this work, we show that the effectiveness of zero-shot rankers diminishes when queries and documents are present in different languages. Motivated by this, we propose to train ranking models on artificially code-switched data instead, which we generate by utilizing bilingual lexicons. To this end, we experiment with lexicons induced from (1) cross-lingual word embeddings and (2) parallel Wikipedia page titles. We use the mMARCO dataset to extensively evaluate reranking models on 36 language pairs spanning Monolingual IR (MoIR), Cross-lingual IR (CLIR), and Multilingual IR (MLIR). Our results show that code-switching can yield consistent and substantial gains of 5.1 MRR@10 in CLIR and 3.9 MRR@10 in MLIR, while maintaining stable performance in MoIR. Encouragingly, the gains are especially pronounced for distant languages (up to 2x absolute gain). We further show that our approach is robust towards the ratio of code-switched tokens and also extends to unseen languages. Our results demonstrate that training on code-switched data is a cheap and effective way of generalizing zero-shot rankers for cross-lingual and multilingual retrieval.

翻译：将信息检索模型从高资源语言（通常为英语）以零样本方式迁移至其他语言已成为广泛应用的方法。本研究表明，当查询与文档分属不同语言时，零样本排序器的有效性会显著下降。受此启发，我们提出利用双语词典生成人工代码混编数据，并在此类数据上训练排序模型。为此，我们实验了从（1）跨语言词嵌入和（2）平行维基百科页面标题中抽取的双语词典。基于mMARCO数据集，我们对涵盖单语言检索、跨语言检索与多语言检索的36个语言对的重排序模型进行系统评估。结果表明，代码混编可在保持单语言检索稳定性能的同时，为跨语言检索带来5.1 MRR@10的显著提升，为多语言检索带来3.9 MRR@10的增益。令人振奋的是，对于远距离语言，性能增益尤为突出（绝对提升可达2倍）。我们进一步证明，该方法对代码混编标记比例具有鲁棒性，并可泛化至未见语言。研究结果证实，基于代码混编数据的训练是一种低成本且有效的途径，可提升零样本排序器在跨语言与多语言检索中的泛化能力。

相关内容

关注 14

信息检索杂志（IR）为信息检索的广泛领域中的理论、算法分析和实验的发布提供了一个国际论坛。感兴趣的主题包括对应用程序（例如Web，社交和流媒体，推荐系统和文本档案）的搜索、索引、分析和评估。这包括对搜索中人为因素的研究、桥接人工智能和信息检索以及特定领域的搜索应用程序。官网地址：https://dblp.uni-trier.de/db/journals/ir/

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日