Recent multilingual pretrained language models (mPLMs) have been shown to encode strong language-specific signals, which are not explicitly provided during pretraining. It remains an open question whether it is feasible to employ mPLMs to measure language similarity, and subsequently use the similarity results to select source languages for boosting cross-lingual transfer. To investigate this, we propose mPLMSim, a language similarity measure that induces the similarities across languages from mPLMs using multi-parallel corpora. Our study shows that mPLM-Sim exhibits moderately high correlations with linguistic similarity measures, such as lexicostatistics, genealogical language family, and geographical sprachbund. We also conduct a case study on languages with low correlation and observe that mPLM-Sim yields more accurate similarity results. Additionally, we find that similarity results vary across different mPLMs and different layers within an mPLM. We further investigate whether mPLMSim is effective for zero-shot cross-lingual transfer by conducting experiments on both low-level syntactic tasks and high-level semantic tasks. The experimental results demonstrate that mPLM-Sim is capable of selecting better source languages than linguistic measures, resulting in a 1%-2% improvement in zero-shot cross-lingual transfer performance.
翻译:近年来,多语言预训练语言模型被证明能够编码强烈的语言特定信号,这些信号在预训练过程中并未明确提供。是否可以利用mPLMs衡量语言相似度,并进一步将相似度结果用于选择源语言以促进跨语言迁移,仍是一个未解决问题。为了探究这一方向,我们提出mPLM-Sim,一种利用多平行语料从mPLMs中诱导语言间相似度的语言相似性度量方法。研究表明,mPLM-Sim与词汇统计、谱系语言家族和地理语言联盟等语言相似度度量呈现中等偏高的相关性。我们针对低相关性语言进行了案例研究,观察到mPLM-Sim能产生更准确的相似度结果。此外,我们发现相似度结果在不同mPLMs以及同一mPLM的不同层之间存在差异。进一步地,我们通过低层句法任务和高层语义任务实验,探究mPLM-Sim在零样本跨语言迁移中的有效性。实验结果表明,mPLM-Sim能够比语言度量方法更优地选择源语言,从而使零样本跨语言迁移性能提升1%-2%。