Recent multilingual pretrained language models (mPLMs) have been shown to encode strong language-specific signals, which are not explicitly provided during pretraining. It remains an open question whether it is feasible to employ mPLMs to measure language similarity, and subsequently use the similarity results to select source languages for boosting cross-lingual transfer. To investigate this, we propose mPLMSim, a language similarity measure that induces the similarities across languages from mPLMs using multi-parallel corpora. Our study shows that mPLM-Sim exhibits moderately high correlations with linguistic similarity measures, such as lexicostatistics, genealogical language family, and geographical sprachbund. We also conduct a case study on languages with low correlation and observe that mPLM-Sim yields more accurate similarity results. Additionally, we find that similarity results vary across different mPLMs and different layers within an mPLM. We further investigate whether mPLMSim is effective for zero-shot cross-lingual transfer by conducting experiments on both low-level syntactic tasks and high-level semantic tasks. The experimental results demonstrate that mPLM-Sim is capable of selecting better source languages than linguistic measures, resulting in a 1%-2% improvement in zero-shot cross-lingual transfer performance.
翻译:近期研究表明,多语言预训练语言模型(mPLMs)能够编码强大的语言特定信号,这些信号并未在预训练过程中显式提供。利用mPLMs度量语言相似性,并进一步基于相似性结果选择源语言以提升跨语言迁移性能,其可行性仍是一个开放问题。为探究此问题,我们提出mPLM-Sim——一种基于多平行语料库从mPLMs中推导跨语言相似性的度量方法。研究表明,mPLM-Sim与词汇统计学、谱系语言家族、地理语言联盟等语言学相似性度量方法呈现中等偏高的相关性。我们对低相关性语言进行了案例研究,发现mPLM-Sim能产生更准确的相似性结果。此外,我们发现相似性结果在不同mPLMs之间以及同一mPLM的不同层之间存在差异。我们进一步通过低层句法任务和高层语义任务的实验,探究mPLM-Sim对零样本跨语言迁移的有效性。实验结果表明,相较于语言学度量方法,mPLM-Sim能够选择更优的源语言,使零样本跨语言迁移性能提升1%-2%。