It has become standard to solve NLP tasks by fine-tuning pre-trained language models (LMs), especially in low-data settings. There is minimal theoretical understanding of empirical success, e.g., why fine-tuning a model with $10^8$ or more parameters on a couple dozen training points does not result in overfitting. We investigate whether the Neural Tangent Kernel (NTK) - which originated as a model to study the gradient descent dynamics of infinitely wide networks with suitable random initialization - describes fine-tuning of pre-trained LMs. This study was inspired by the decent performance of NTK for computer vision tasks (Wei et al., 2022). We extend the NTK formalism to Adam and use Tensor Programs (Yang, 2020) to characterize conditions under which the NTK lens may describe fine-tuning updates to pre-trained language models. Extensive experiments on 14 NLP tasks validate our theory and show that formulating the downstream task as a masked word prediction problem through prompting often induces kernel-based dynamics during fine-tuning. Finally, we use this kernel view to propose an explanation for the success of parameter-efficient subspace-based fine-tuning methods.
翻译:通过微调预训练语言模型(LM)来解决自然语言处理任务已成为标准做法,尤其在低数据场景下。然而,对于这种经验成功背后的理论理解十分有限——例如,为什么在几十个训练样本上微调一个包含10^8个或更多参数的模型不会导致过拟合?我们探究神经正切核(NTK)——最初作为研究具有适当随机初始化的无限宽网络梯度下降动力学的模型——是否能描述预训练语言模型的微调过程。这项研究受到NTK在计算机视觉任务中表现不俗的启发(Wei等人,2022)。我们将NTK形式化方法扩展至Adam优化器,并利用张量程序(Yang,2020)刻画NTK视角能够描述预训练语言模型微调更新的条件。针对14个NLP任务的大量实验验证了我们的理论,并表明通过提示将下游任务构建为掩码词预测问题时,微调过程通常会诱导基于核的动力学行为。最后,我们利用这一核视角为参数高效的子空间微调方法的成功提供了解释。