Fine-tuning a pre-trained model (such as BERT, ALBERT, RoBERTa, T5, GPT, etc.) has proven to be one of the most promising paradigms in recent NLP research. However, numerous recent works indicate that fine-tuning suffers from the instability problem, i.e., tuning the same model under the same setting results in significantly different performance. Many recent works have proposed different methods to solve this problem, but there is no theoretical understanding of why and how these methods work. In this paper, we propose a novel theoretical stability analysis of fine-tuning that focuses on two commonly used settings, namely, full fine-tuning and head tuning. We define the stability under each setting and prove the corresponding stability bounds. The theoretical bounds explain why and how several existing methods can stabilize the fine-tuning procedure. In addition to being able to explain most of the observed empirical discoveries, our proposed theoretical analysis framework can also help in the design of effective and provable methods. Based on our theory, we propose three novel strategies to stabilize the fine-tuning procedure, namely, Maximal Margin Regularizer (MMR), Multi-Head Loss (MHLoss), and Self Unsupervised Re-Training (SURT). We extensively evaluate our proposed approaches on 11 widely used real-world benchmark datasets, as well as hundreds of synthetic classification datasets. The experiment results show that our proposed methods significantly stabilize the fine-tuning procedure and also corroborate our theoretical analysis.
翻译:微调预训练模型(如BERT、ALBERT、RoBERTa、T5、GPT等)已被证明是近年来自然语言处理研究中最有前景的范式之一。然而,大量近期研究表明,微调存在不稳定性问题,即在相同设置下调整同一模型会导致显著不同的性能表现。许多近期研究提出了不同方法来解决此问题,但尚未从理论上理解这些方法为何有效以及如何运作。本文针对两种常用设置(即全微调和头部微调)提出了一种新颖的微调理论稳定性分析。我们定义了每种设置下的稳定性,并证明了相应的稳定性界。这些理论界解释了现有多种方法为何及如何能够稳定微调过程。除了能够解释大多数观察到的实证发现外,我们提出的理论分析框架还有助于设计有效且可验证的方法。基于我们的理论,我们提出了三种稳定微调过程的新策略:最大间隔正则化器(MMR)、多头损失(MHLoss)和自监督再训练(SURT)。我们在11个广泛使用的真实世界基准数据集以及数百个合成分类数据集上对提出的方法进行了全面评估。实验结果表明,我们的方法显著稳定了微调过程,同时也证实了我们的理论分析。