The pretraining-fine-tuning paradigm has been the de facto strategy for transfer learning in modern language modeling. With the understanding that task adaptation in LMs is often a function of parameters shared across tasks, we argue that a more surgical approach to regularization needs to exist for smoother transfer learning. Towards this end, we investigate how the pretraining loss landscape is affected by these task-sensitive parameters through an information-theoretic lens. We then leverage the findings from our investigations to devise a novel approach to dropout for improved model regularization and better downstream generalization. This approach, named guided dropout, is both task & architecture agnostic and adds no computational overhead to the fine-tuning process. Through empirical evaluations, we showcase that our approach to regularization yields consistently better performance, even in scenarios of data paucity, compared to standardized baselines.
翻译:预训练-微调范式已成为现代语言建模中迁移学习的事实标准策略。基于对语言模型中任务适应通常是跨任务共享参数函数的理解,我们认为需要存在一种更精准的正则化方法以实现更平滑的迁移学习。为此,我们通过信息论视角研究预训练损失景观如何受这些任务敏感参数的影响。随后,我们利用研究结果设计了一种新颖的dropout方法,以改进模型正则化并提升下游泛化能力。这种名为引导dropout的方法既与任务及架构无关,又不会给微调过程增加计算开销。通过实证评估,我们证明相较于标准化基线,我们的正则化方法即使在数据稀缺场景下也能持续获得更优的性能表现。