Fine-tuning a pre-trained language model (PLM) emerges as the predominant strategy in many natural language processing applications. However, even fine-tuning the PLMs and doing inference are expensive, especially on edge devices with low computing power. Some general approaches (e.g. quantization and distillation) have been widely studied to reduce the compute/memory of PLM fine-tuning, while very few one-shot compression techniques are explored. In this paper, we investigate the neural tangent kernel (NTK)--which reveals the gradient descent dynamics of neural networks--of the multilayer perceptrons (MLP) modules in a PLM and propose to coin a lightweight PLM through NTK-approximating MLP fusion. To achieve this, we reconsider the MLP as a bundle of sub-MLPs, and cluster them into a given number of centroids, which can then be restored as a compressed MLP and surprisingly shown to well approximate the NTK of the original PLM. Extensive experiments of PLM fine-tuning on both natural language understanding (NLU) and generation (NLG) tasks are provided to verify the effectiveness of the proposed method MLP fusion. Our code is available at https://github.com/weitianxin/MLP_Fusion.
翻译:微调预训练语言模型(PLM)已成为许多自然语言处理应用中的主流策略。然而,即便是微调PLM及其推理过程仍成本高昂,尤其在计算能力受限的边缘设备上。目前,一些通用方法(如量化与知识蒸馏)已被广泛研究以降低PLM微调的计算/内存开销,但极少有一次性压缩技术被探索。本文研究了PLM中多层感知机(MLP)模块的神经正切核(NTK)——该核揭示了神经网络的梯度下降动力学特征——并提出通过NTK近似MLP融合来构建轻量级PLM。为此,我们将MLP重新视为子MLP的集合,将其聚类至给定数量的质心,进而重构为压缩后的MLP,并惊人地发现该压缩MLP能良好逼近原始PLM的NTK。通过自然语言理解(NLU)与自然语言生成(NLG)任务上的大量PLM微调实验,验证了所提MLP融合方法的有效性。我们的代码已开源至https://github.com/weitianxin/MLP_Fusion。