With multilingual machine translation (MMT) models continuing to grow in size and number of supported languages, it is natural to reuse and upgrade existing models to save computation as data becomes available in more languages. However, adding new languages requires updating the vocabulary, which complicates the reuse of embeddings. The question of how to reuse existing models while also making architectural changes to provide capacity for both old and new languages has also not been closely studied. In this work, we introduce three techniques that help speed up effective learning of the new languages and alleviate catastrophic forgetting despite vocabulary and architecture mismatches. Our results show that by (1) carefully initializing the network, (2) applying learning rate scaling, and (3) performing data up-sampling, it is possible to exceed the performance of a same-sized baseline model with 30% computation and recover the performance of a larger model trained from scratch with over 50% reduction in computation. Furthermore, our analysis reveals that the introduced techniques help learn the new directions more effectively and alleviate catastrophic forgetting at the same time. We hope our work will guide research into more efficient approaches to growing languages for these MMT models and ultimately maximize the reuse of existing models.
翻译:随着多语言机器翻译(MMT)模型的规模不断扩大且支持的语言数量持续增加,合理利用并升级现有模型以节省计算资源成为自然需求——尤其当新语言的数据可用时。然而,新增语言需要更新词汇表,这使嵌入层的复用复杂化。如何在复用现有模型的同时进行架构调整以兼顾新旧语言的能力分配,这一问题尚未得到深入探讨。本文提出三种技术,能够加速新语言的高效学习,并有效缓解因词汇表与架构不匹配导致的灾难性遗忘。实验结果表明:通过(1)精心初始化网络、(2)应用学习率缩放、(3)执行数据上采样,我们仅需30%的计算量即可超越同等规模基线模型的性能;而恢复从头训练的更大规模模型性能时,计算量可降低50%以上。进一步分析表明,所提技术既能更高效地学习新语言方向,又能同步缓解灾难性遗忘。我们希望这项工作能为MMT模型的语言扩展探索更高效的方案,并最终最大化现有模型的复用价值。