Standard fine-tuning of language models typically performs well on in-distribution data, but suffers with generalization to distribution shifts. In this work, we aim to improve the generalization of adapter-based cross-lingual task transfer where such cross-language distribution shifts are imminent. We investigate scheduled unfreezing algorithms -- originally proposed to mitigate catastrophic forgetting in transfer learning -- for fine-tuning task adapters. Our experiments show that scheduled unfreezing methods close the gap to full fine-tuning and achieve stronger cross-lingual transfer performance, suggesting that these methods can go beyond just mitigating catastrophic forgetting. Next, aiming to understand these empirical findings, we investigate the learning dynamics of scheduled unfreezing using Fisher Information. Our experiments reveal that scheduled unfreezing induces different learning dynamics compared to standard fine-tuning, and provide evidence that the dynamics of Fisher Information during training correlate with cross-lingual generalization performance. We additionally propose a general scheduled unfreezing algorithm that achieves an average of 2 points improvement over four datasets compared to standard fine-tuning and provides empirical evidence for a theory-based justification of the heuristic unfreezing schedule for adapter training.
翻译:语言模型的标准微调通常在内分布数据上表现良好,但在面对分布偏移时泛化能力不足。本研究旨在提升基于适配器的跨语言任务迁移的泛化性能——此类场景中跨语言分布偏移尤为显著。我们系统研究了原本为缓解迁移学习中灾难性遗忘而提出的计划解冻算法,并将其应用于任务适配器微调。实验表明,计划解冻方法能够缩小与全参数微调的差距,并实现更强的跨语言迁移性能,说明这些方法的作用不仅限于缓解灾难性遗忘。为深入理解这些实证结果,我们利用Fisher信息探究了计划解冻的学习动态。实验揭示,与标准微调相比,计划解冻会诱导不同的学习动态,并证明训练过程中Fisher信息的动态变化与跨语言泛化性能存在相关性。此外,我们提出一种通用计划解冻算法,在四个数据集上平均提升2个性能点(相较于标准微调),并为适配器训练中启发式解冻策略的理论依据提供了实证支持。