The recent increase in data and model scale for language model pre-training has led to huge training costs. In scenarios where new data become available over time, updating a model instead of fully retraining it would therefore provide significant gains. We study the pros and cons of updating a language model when new data comes from new languages -- the case of continual learning under language shift. Starting from a monolingual English language model, we incrementally add data from Danish, Icelandic, and Norwegian to investigate how forward and backward transfer effects depend on pre-training order and characteristics of languages, for three different model sizes. Our results show that, while forward transfer is largely positive and independent of language order, backward transfer can be positive or negative depending on the order and characteristics of new languages. We explore a number of potentially explanatory factors and find that a combination of language contamination and syntactic similarity best fits our results.
翻译:近年来,语言模型预训练所需的数据规模和模型体量不断增长,导致训练成本急剧上升。在新数据随时间逐步产生的应用场景中,对已有模型进行更新而非完全重新训练,将能显著降低计算开销。本文研究了当新增数据来自不同语言时更新语言模型的利弊——即语言迁移背景下的持续学习问题。我们以单语英语语言模型为起点,逐步加入丹麦语、冰岛语和挪威语数据,探究了三种不同规模模型中前向迁移与后向迁移效应如何受预训练顺序及语言特性的影响。实验结果表明:前向迁移效应总体呈正向且与语言顺序无关,而后向迁移效应则可能为正或为负,具体取决于新增语言的顺序及其特征。我们探究了若干潜在的解释因素,发现语言污染度与句法相似性的组合最能合理解释实验结果。