Parameter-efficient fine-tuning (PEFT) of pre-trained language models has recently demonstrated remarkable achievements, effectively matching the performance of full fine-tuning while utilizing significantly fewer trainable parameters, and consequently addressing the storage and communication constraints. Nonetheless, various PEFT methods are limited by their inherent characteristics. In the case of sparse fine-tuning, which involves modifying only a small subset of the existing parameters, the selection of fine-tuned parameters is task- and domain-specific, making it unsuitable for federated learning. On the other hand, PEFT methods with adding new parameters typically introduce additional inference latency. In this paper, we demonstrate the feasibility of generating a sparse mask in a task-agnostic manner, wherein all downstream tasks share a common mask. Our approach, which relies solely on the magnitude information of pre-trained parameters, surpasses existing methodologies by a significant margin when evaluated on the GLUE benchmark. Additionally, we introduce a novel adapter technique that directly applies the adapter to pre-trained parameters instead of the hidden representation, thereby achieving identical inference speed to that of full fine-tuning. Through extensive experiments, our proposed method attains a new state-of-the-art outcome in terms of both performance and storage efficiency, storing only 0.03% parameters of full fine-tuning.
翻译:预训练语言模型的参数高效微调方法近期取得了显著成果,在仅使用少量可训练参数的情况下有效匹配了全参数微调的性能,从而解决了存储与通信限制问题。然而,各类参数高效微调方法受其固有特性制约。在涉及仅修改部分现有参数的稀疏微调中,参数选择具有任务与领域特异性,使其不适用于联邦学习。另一方面,引入新参数的参数高效微调方法通常会带来额外推理延迟。本文展示了以任务无关方式生成稀疏掩码的可行性,使所有下游任务共享统一掩码。我们的方法仅依赖预训练参数的幅度信息,在GLUE基准测试中显著超越现有方法。此外,我们提出一种新型适配器技术,将适配器直接应用于预训练参数而非隐层表示,从而实现与全参数微调相同的推理速度。通过大量实验,我们的方法在性能和存储效率方面均达到新最优水平,仅需存储全参数微调0.03%的参数。