Massively multilingual sentence representation models, e.g., LASER, SBERT-distill, and LaBSE, help significantly improve cross-lingual downstream tasks. However, the use of a large amount of data or inefficient model architectures results in heavy computation to train a new model according to our preferred languages and domains. To resolve this issue, we introduce efficient and effective massively multilingual sentence embedding (EMS), using cross-lingual token-level reconstruction (XTR) and sentence-level contrastive learning as training objectives. Compared with related studies, the proposed model can be efficiently trained using significantly fewer parallel sentences and GPU computation resources. Empirical results showed that the proposed model significantly yields better or comparable results with regard to cross-lingual sentence retrieval, zero-shot cross-lingual genre classification, and sentiment classification. Ablative analyses demonstrated the efficiency and effectiveness of each component of the proposed model. We release the codes for model training and the EMS pre-trained sentence embedding model, which supports 62 languages ( https://github.com/Mao-KU/EMS ).
翻译:大规模多语言句子表示模型(例如 LASER、SBERT-distill 和 LaBSE)显著提升了跨语言下游任务的性能。然而,根据我们偏好的语言和领域训练新模型时,大量数据的使用或低效的模型架构会导致沉重的计算负担。为解决此问题,我们引入了高效且有效的大规模多语言句子嵌入(EMS),采用跨语言词元级重构(XTR)和句子级对比学习作为训练目标。与相关研究相比,所提出的模型能够使用显著更少的平行句子和 GPU 计算资源进行高效训练。实证结果表明,所提模型在跨语言句子检索、零样本跨语言体裁分类和情感分类任务上,均取得了显著更好或相当的结果。消融分析验证了所提模型各组成部分的效率和有效性。我们发布了模型训练代码及 EMS 预训练句子嵌入模型,该模型支持 62 种语言( https://github.com/Mao-KU/EMS )。