Despite the massive success of fine-tuning Pre-trained Language Models (PLMs), they remain susceptible to out-of-distribution input. Dataset cartography is a simple yet effective dual-model approach that improves the robustness of fine-tuned PLMs. It involves fine-tuning a model on the original training set (i.e. reference model), selecting a subset of important training instances based on the training dynamics, and fine-tuning again only on these selected examples (i.e. main model). However, this approach requires fine-tuning the same model twice, which is computationally expensive for large PLMs. In this paper, we show that (1) training dynamics are highly transferable across model sizes and pre-training methods, and that (2) fine-tuning main models using these selected training instances achieves higher training efficiency than empirical risk minimization (ERM). Building on these observations, we propose a novel fine-tuning approach: Fine-Tuning by transFerring Training dynamics (FTFT). Compared with dataset cartography, FTFT uses more efficient reference models and aggressive early stopping. FTFT achieves robustness improvements over ERM while lowering the training cost by up to $\sim 50\%$.
翻译:尽管对预训练语言模型(PLM)进行微调已取得巨大成功,但其仍易受分布外输入的影响。数据集制图是一种简单而有效的双模型方法,可提升微调后PLM的鲁棒性。该方法首先在原始训练集上微调模型(即参考模型),根据训练动态选择重要的训练实例子集,随后仅在这些选定样本上再次微调模型(即主模型)。然而,该方法需要对同一模型进行两次微调,对于大型PLM而言计算成本高昂。本文研究表明:(1)训练动态在不同模型规模与预训练方法间具有高度可迁移性;(2)使用这些选定训练实例微调主模型,比经验风险最小化(ERM)具有更高的训练效率。基于这些发现,我们提出一种新颖的微调方法:通过迁移训练动态进行微调(FTFT)。相较于数据集制图法,FTFT采用更高效的参考模型与激进的早停策略。该方法在实现优于ERM的鲁棒性提升的同时,将训练成本降低约50%。