When we transfer a pretrained language model to a new language, there are many axes of variation that change at once. To disentangle the impact of different factors like syntactic similarity and vocabulary similarity, we propose a set of controlled transfer studies: we systematically transform the language of the GLUE benchmark, altering one axis of crosslingual variation at a time, and then measure the resulting drops in a pretrained model's downstream performance. We find that models can largely recover from syntactic-style shifts, but cannot recover from vocabulary misalignment and embedding matrix re-initialization, even with continued pretraining on 15 million tokens. %On the other hand, transferring to a dataset with an unaligned vocabulary is extremely hard to recover from in the low-data regime. Moreover, good-quality tokenizers in the transfer language do not make vocabulary alignment easier. Our experiments provide insights into the factors of cross-lingual transfer that researchers should most focus on when designing language transfer scenarios.
翻译:当我们将预训练语言模型迁移至新语言时,多种变化轴会同时产生变异。为厘清句法相似性、词汇相似性等不同因素的影响,我们提出一套受控迁移研究方案:系统性地对GLUE基准测试的语言特征进行变换,每次仅改变跨语言变化的单一维度,继而测量预训练模型下游性能的相应降幅。实验发现,模型能够较好地适应句法风格迁移,但面对词汇错配与嵌入矩阵重新初始化时,即使经过1500万词元的持续预训练也无法恢复。此外,目标语言中高质量的标记化处理并不能简化词汇对齐过程。本实验为跨语言迁移场景设计者提供了应重点关注的迁移影响因素启示。