Multilingual sentence representations are the foundation for similarity-based bitext mining, which is crucial for scaling multilingual neural machine translation (NMT) system to more languages. In this paper, we introduce MuSR: a one-for-all Multilingual Sentence Representation model that supports more than 220 languages. Leveraging billions of English-centric parallel corpora, we train a multilingual Transformer encoder, coupled with an auxiliary Transformer decoder, by adopting a multilingual NMT framework with CrossConST, a cross-lingual consistency regularization technique proposed in Gao et al. (2023). Experimental results on multilingual similarity search and bitext mining tasks show the effectiveness of our approach. Specifically, MuSR achieves superior performance over LASER3 (Heffernan et al., 2022) which consists of 148 independent multilingual sentence encoders.
翻译:多语言句子表示是基于相似度的双语文本挖掘的基础,这对于将多语言神经机器翻译(NMT)系统扩展到更多语言至关重要。本文提出了MuSR:一种支持220种以上语言的通用多语言句子表示模型。我们利用数十亿英语为中心的平行语料库,结合多语言NMT框架与CrossConST(Gao等人,2023年提出的跨语言一致性正则化技术),训练了一个多语言Transformer编码器及辅助Transformer解码器。在多语言相似性搜索和双语文本挖掘任务上的实验结果表明了该方法的有效性。具体而言,MuSR在性能上超越了由148个独立多语言句子编码器组成的LASER3(Heffernan等人,2022年)。