Recent multilingual pre-trained models have shown better performance in various multilingual tasks. However, these models perform poorly on multilingual retrieval tasks due to lacking multilingual training data. In this paper, we propose to mine and generate self-supervised training data based on a large-scale unlabeled corpus. We carefully design a mining method which combines the sparse and dense models to mine the relevance of unlabeled queries and passages. And we introduce a query generator to generate more queries in target languages for unlabeled passages. Through extensive experiments on Mr. TYDI dataset and an industrial dataset from a commercial search engine, we demonstrate that our method performs better than baselines based on various pre-trained multilingual models. Our method even achieves on-par performance with the supervised method on the latter dataset.
翻译:近期多语言预训练模型在多种多语言任务中展现了更优性能。然而,由于缺乏多语言训练数据,这些模型在多语言检索任务中表现欠佳。本文提出基于大规模无标注语料挖掘并生成自监督训练数据的方法。我们精心设计了一种结合稀疏模型与稠密模型的挖掘方法,以发掘无标注查询与段落的相关性;同时引入查询生成器,为目标语言中的无标注段落生成更多查询。通过在Mr. TYDI数据集及商业搜索引擎的工业数据集上的广泛实验,我们证明该方法在多种预训练多语言模型基准上的表现优于基线方法。在后者数据集中,我们的方法甚至达到了与监督方法相当的性能。