Fine-tuning pretrained language models (PLMs) for downstream tasks is a large-scale optimization problem, in which the choice of the training algorithm critically determines how well the trained model can generalize to unseen test data, especially in the context of few-shot learning. To achieve good generalization performance and avoid overfitting, techniques such as data augmentation and pruning are often applied. However, adding these regularizations necessitates heavy tuning of the hyperparameters of optimization algorithms, such as the popular Adam optimizer. In this paper, we propose a two-stage fine-tuning method, PAC-tuning, to address this optimization challenge. First, based on PAC-Bayes training, PAC-tuning directly minimizes the PAC-Bayes generalization bound to learn proper parameter distribution. Second, PAC-tuning modifies the gradient by injecting noise with the variance learned in the first stage into the model parameters during training, resulting in a variant of perturbed gradient descent (PGD). In the past, the few-shot scenario posed difficulties for PAC-Bayes training because the PAC-Bayes bound, when applied to large models with limited training data, might not be stringent. Our experimental results across 5 GLUE benchmark tasks demonstrate that PAC-tuning successfully handles the challenges of fine-tuning tasks and outperforms strong baseline methods by a visible margin, further confirming the potential to apply PAC training for any other settings where the Adam optimizer is currently used for training.
翻译:微调预训练语言模型(PLMs)以适应下游任务是一个大规模优化问题,其中训练算法的选择关键决定了训练模型在未见测试数据上的泛化能力,尤其是在少样本学习场景中。为获得良好的泛化性能并避免过拟合,通常采用数据增强和剪枝等技术。然而,加入这些正则化方法需要大量调整优化算法(如流行的Adam优化器)的超参数。本文提出一种两阶段微调方法PAC-tuning来解决这一优化挑战。首先,基于PAC-Bayes训练,PAC-tuning直接最小化PAC-Bayes泛化界以学习合适的参数分布。其次,PAC-tuning通过将第一阶段学习到的方差作为噪声注入模型参数来修正梯度,从而形成扰动梯度下降(PGD)的变体。以往,少样本场景给PAC-Bayes训练带来困难,因为PAC-Bayes界在训练数据有限的大型模型上可能不够严格。我们在5个GLUE基准任务上的实验结果表明,PAC-tuning成功处理了微调任务的挑战,并以显著优势超越强基线方法,进一步证实了将PAC训练应用于当前使用Adam优化器进行训练的其他任何设置的潜力。