State-of-the-art neural retrievers predominantly focus on high-resource languages like English, which impedes their adoption in retrieval scenarios involving other languages. Current approaches circumvent the lack of high-quality labeled data in non-English languages by leveraging multilingual pretrained language models capable of cross-lingual transfer. However, these models require substantial task-specific fine-tuning across multiple languages, often perform poorly in languages with minimal representation in the pretraining corpus, and struggle to incorporate new languages after the pretraining phase. In this work, we present a novel modular dense retrieval model that learns from the rich data of a single high-resource language and effectively zero-shot transfers to a wide array of languages, thereby eliminating the need for language-specific labeled data. Our model, ColBERT-XM, demonstrates competitive performance against existing state-of-the-art multilingual retrievers trained on more extensive datasets in various languages. Further analysis reveals that our modular approach is highly data-efficient, effectively adapts to out-of-distribution data, and significantly reduces energy consumption and carbon emissions. By demonstrating its proficiency in zero-shot scenarios, ColBERT-XM marks a shift towards more sustainable and inclusive retrieval systems, enabling effective information accessibility in numerous languages. We publicly release our code and models for the community.
翻译:当前最先进的神经检索模型主要聚焦于英语等高资源语言,这限制了其在其他语言检索场景中的适用性。现有方法通过利用具备跨语言迁移能力的多语言预训练语言模型,规避非英语语言缺乏高质量标注数据的问题。然而,这些模型需要在多种语言上进行大量任务特定微调,在预训练语料库中占比极小的语言上表现不佳,且难以在预训练阶段后纳入新语言。本文提出一种新颖的模块化稠密检索模型,该模型仅从单一高资源语言的丰富数据中学习,即可有效零样本迁移至广泛的语言集合,从而无需任何语言特定的标注数据。我们的模型ColBERT-XM在各类语言的更大规模数据集上,展现出与现有最先进多语言检索模型相媲美的竞争力。进一步分析表明,该模块化方法具有极高的数据效率、能有效适应分布外数据,并显著降低能耗与碳排放。通过证明其在零样本场景中的优越性能,ColBERT-XM标志着检索系统向更可持续、更包容的方向演进,实现多语言信息的高效可访问性。我们已将代码与模型公开发布,供社区使用。