Expressing universal semantics common to all languages is helpful in understanding the meanings of complex and culture-specific sentences. The research theme underlying this scenario focuses on learning universal representations across languages with the usage of massive parallel corpora. However, due to the sparsity and scarcity of parallel data, there is still a big challenge in learning authentic ``universals'' for any two languages. In this paper, we propose EMMA-X: an EM-like Multilingual pre-training Algorithm, to learn (X)Cross-lingual universals with the aid of excessive multilingual non-parallel data. EMMA-X unifies the cross-lingual representation learning task and an extra semantic relation prediction task within an EM framework. Both the extra semantic classifier and the cross-lingual sentence encoder approximate the semantic relation of two sentences, and supervise each other until convergence. To evaluate EMMA-X, we conduct experiments on XRETE, a newly introduced benchmark containing 12 widely studied cross-lingual tasks that fully depend on sentence-level representations. Results reveal that EMMA-X achieves state-of-the-art performance. Further geometric analysis of the built representation space with three requirements demonstrates the superiority of EMMA-X over advanced models.
翻译:表达所有语言共有的通用语义有助于理解复杂且具有文化特异性的句子含义。此场景下的研究主题聚焦于利用大规模平行语料库学习跨语言的通用表征。然而,由于平行数据的稀疏性和稀缺性,为任意两种语言学习真正的“语言共性”仍面临重大挑战。本文提出EMMA-X:一种基于EM的多语言预训练算法,借助海量非平行多语言数据学习跨语言共性。EMMA-X在EM框架内统一了跨语言表征学习任务与额外的语义关系预测任务。额外的语义分类器与跨语言句子编码器共同逼近两个句子的语义关系,并相互监督直至收敛。为评估EMMA-X,我们在新引入的基准测试XRETE上开展实验,该基准包含12项完全依赖句子级表征的广泛研究的跨语言任务。结果表明EMMA-X实现了最先进的性能。进一步基于三项要求对构建的表征空间进行几何分析,证明了EMMA-X相较于先进模型的优越性。