It has become standard to solve NLP tasks by fine-tuning pre-trained language models (LMs), especially in low-data settings. There is minimal theoretical understanding of empirical success, e.g., why fine-tuning a model with $10^8$ or more parameters on a couple dozen training points does not result in overfitting. We investigate whether the Neural Tangent Kernel (NTK) - which originated as a model to study the gradient descent dynamics of infinitely wide networks with suitable random initialization - describes fine-tuning of pre-trained LMs. This study was inspired by the decent performance of NTK for computer vision tasks (Wei et al., 2022). We extend the NTK formalism to Adam and use Tensor Programs (Yang, 2020) to characterize conditions under which the NTK lens may describe fine-tuning updates to pre-trained language models. Extensive experiments on 14 NLP tasks validate our theory and show that formulating the downstream task as a masked word prediction problem through prompting often induces kernel-based dynamics during fine-tuning. Finally, we use this kernel view to propose an explanation for the success of parameter-efficient subspace-based fine-tuning methods.
翻译:通过微调预训练语言模型(LM)解决自然语言处理任务已成为标准做法,尤其在低数据场景下。然而,对经验成功(例如微调包含10^8个以上参数的模型时仅使用数十个训练点却未导致过拟合)的理论理解仍十分有限。我们探究了神经正切核(NTK)——最初作为研究具有适当随机初始化的无限宽网络梯度下降动力学的模型——是否能够描述预训练LM的微调过程。本研究的灵感源于NTK在计算机视觉任务中的良好表现(Wei等人,2022)。我们将NTK形式化方法扩展到Adam优化器,并利用张量程序(Yang, 2020)刻画NTK视角能够描述预训练语言模型微调更新的条件。在14个NLP任务上的大量实验验证了我们的理论,表明通过提示将下游任务表述为掩码词预测问题时,往往会在微调过程中诱发基于核的动力学。最后,我们利用这种核视角为参数高效子空间微调方法的成功提供了解释。