Fine-tuning a pre-trained language model (PLM) emerges as the predominant strategy in many natural language processing applications. However, even fine-tuning the PLMs and doing inference are expensive, especially on edge devices with low computing power. Some general approaches (e.g. quantization and distillation) have been widely studied to reduce the compute/memory of PLM fine-tuning, while very few one-shot compression techniques are explored. In this paper, we investigate the neural tangent kernel (NTK)--which reveals the gradient descent dynamics of neural networks--of the multilayer perceptrons (MLP) modules in a PLM and propose to coin a lightweight PLM through NTK-approximating MLP fusion. To achieve this, we reconsider the MLP as a bundle of sub-MLPs, and cluster them into a given number of centroids, which can then be restored as a compressed MLP and surprisingly shown to well approximate the NTK of the original PLM. Extensive experiments of PLM fine-tuning on both natural language understanding (NLU) and generation (NLG) tasks are provided to verify the effectiveness of the proposed method MLP fusion. Our code is available at https://github.com/weitianxin/MLP_Fusion.
翻译:微调预训练语言模型(PLM)成为许多自然语言处理应用中的主流策略。然而,即使微调PLM并进行推理也代价高昂,尤其是在计算能力较低的边缘设备上。一些通用方法(如量化和蒸馏)已被广泛研究以减少PLM微调的计算/内存消耗,但极少探索一次性压缩技术。本文研究了PLM中多层感知机(MLP)模块的神经正切核(NTK)——其揭示了神经网络的梯度下降动力学,并提出了通过NTK近似MLP融合来构建轻量级PLM的方法。为实现这一目标,我们将MLP重新视为子MLP的集合,将其聚类为给定数量的中心点,进而重构为压缩的MLP,并惊奇地发现该压缩MLP能很好地近似原始PLM的NTK。我们在自然语言理解(NLU)和自然语言生成(NLG)任务上进行了大量PLM微调实验,以验证所提出的MLP融合方法的有效性。我们的代码开源于https://github.com/weitianxin/MLP_Fusion。