Multilingual sentence representations from large models encode semantic information from two or more languages and can be used for different cross-lingual information retrieval and matching tasks. In this paper, we integrate contrastive learning into multilingual representation distillation and use it for quality estimation of parallel sentences (i.e., find semantically similar sentences that can be used as translations of each other). We validate our approach with multilingual similarity search and corpus filtering tasks. Experiments across different low-resource languages show that our method greatly outperforms previous sentence encoders such as LASER, LASER3, and LaBSE.
翻译:大型模型的多语句子表示编码了两种或多种语言的语义信息,可用于不同的跨语言信息检索与匹配任务。本文我们将对比学习融入多语言表示蒸馏,并将其应用于平行句对的质量评估(即寻找语义相似且可互为翻译的句子)。我们通过多语言相似性搜索和语料过滤任务验证了该方法。针对不同低资源语言的实验表明,我们的方法显著优于LASER、LASER3和LaBSE等先前句子编码器。