Providing access to information across languages has been a goal of Information Retrieval (IR) for decades. While progress has been made on Cross Language IR (CLIR) where queries are expressed in one language and documents in another, the multilingual (MLIR) task to create a single ranked list of documents across many languages is considerably more challenging. This paper investigates whether advances in neural document translation and pretrained multilingual neural language models enable improvements in the state of the art over earlier MLIR techniques. The results show that although combining neural document translation with neural ranking yields the best Mean Average Precision (MAP), 98% of that MAP score can be achieved with an 84% reduction in indexing time by using a pretrained XLM-R multilingual language model to index documents in their native language, and that 2% difference in effectiveness is not statistically significant. Key to achieving these results for MLIR is to fine-tune XLM-R using mixed-language batches from neural translations of MS MARCO passages.
翻译:为跨语言信息检索提供访问一直是信息检索(IR)领域数十年的目标。尽管在跨语言信息检索(CLIR)方面已取得进展(其中查询以某种语言表示,而文档以另一种语言表示),但多语言信息检索(MLIR)任务(即构建跨多种语言的单一文档排序列表)则更具挑战性。本文探讨了神经文档翻译和预训练的多语言神经语言模型的进展是否能够使现有技术超越早期MLIR方法。结果表明,尽管将神经文档翻译与神经排序相结合能获得最佳平均精度(MAP),但通过使用预训练的XLM-R多语言语言模型以原生语言索引文档,可在索引时间减少84%的条件下达到该MAP分数的98%,且这2%的效果差异在统计上不显著。实现这些MLIR结果的关键在于使用MS MARCO语料库中神经翻译生成的混合语言批次对XLM-R进行微调。