Dense retrieval methods have demonstrated promising performance in multilingual information retrieval, where queries and documents can be in different languages. However, dense retrievers typically require a substantial amount of paired data, which poses even greater challenges in multilingual scenarios. This paper introduces UMR, an Unsupervised Multilingual dense Retriever trained without any paired data. Our approach leverages the sequence likelihood estimation capabilities of multilingual language models to acquire pseudo labels for training dense retrievers. We propose a two-stage framework which iteratively improves the performance of multilingual dense retrievers. Experimental results on two benchmark datasets show that UMR outperforms supervised baselines, showcasing the potential of training multilingual retrievers without paired data, thereby enhancing their practicality. Our source code, data, and models are publicly available at https://github.com/MiuLab/UMR
翻译:稠密检索方法在多语言信息检索中展现了有前景的性能,其中查询和文档可能采用不同语言。然而,稠密检索器通常需要大量配对数据,这在多语言场景中带来了更大的挑战。本文提出UMR,一种无需任何配对数据即可训练的无监督多语言稠密检索器。我们的方法利用多语言语言模型的序列似然估计能力来获取用于训练稠密检索器的伪标签。我们提出了一个两阶段框架,该框架迭代地提升多语言稠密检索器的性能。在两个基准数据集上的实验结果表明,UMR优于有监督基线方法,展示了在没有配对数据的情况下训练多语言检索器的潜力,从而增强了其实用性。我们的源代码、数据和模型已公开于https://github.com/MiuLab/UMR。