PaCA: Partial Connection Adaptation for Efficient Fine-Tuning

Prior parameter-efficient fine-tuning (PEFT) algorithms reduce memory usage and computational costs of fine-tuning large neural network models by training only a few additional adapter parameters, rather than the entire model. However, the reduction in computational costs due to PEFT does not necessarily translate to a reduction in training time; although the computational costs of the adapter layers are much smaller than the pretrained layers, it is well known that those two types of layers are processed sequentially on GPUs, resulting in significant latency overhead. LoRA and its variants merge low-rank adapter matrices with pretrained weights during inference to avoid latency overhead, but during training, the pretrained weights remain frozen while the adapter matrices are continuously updated, preventing such merging. To mitigate this issue, we propose Partial Connection Adaptation (PaCA), which fine-tunes randomly selected partial connections within the pretrained weights instead of introducing adapter layers in the model. PaCA not only enhances training speed by eliminating the time overhead due to the sequential processing of the adapter and pretrained layers but also reduces activation memory since only partial activations, rather than full activations, need to be stored for gradient computation. Compared to LoRA, PaCA reduces training time by 22% and total memory usage by 16%, while maintaining comparable accuracy across various fine-tuning scenarios, such as fine-tuning on the MMLU dataset and instruction tuning on the Oasst1 dataset. PaCA can also be combined with quantization, enabling the fine-tuning of large models such as LLaMA3.1-70B. In addition, PaCA enables training with 23% longer sequence and improves throughput by 16% on both NVIDIA A100 GPU and INTEL Gaudi2 HPU compared to LoRA. The code is available at https://github.com/WooSunghyeon/paca.

翻译：先前的参数高效微调（PEFT）算法通过仅训练少量额外的适配器参数，而非整个模型，来降低大型神经网络模型微调的内存占用与计算成本。然而，PEFT带来的计算成本降低并不必然转化为训练时间的减少；尽管适配器层的计算成本远低于预训练层，但众所周知，这两类层在GPU上是顺序处理的，从而产生了显著的延迟开销。LoRA及其变体在推理期间将低秩适配器矩阵与预训练权重合并以避免延迟开销，但在训练期间，预训练权重保持冻结而适配器矩阵持续更新，阻碍了此类合并。为缓解此问题，我们提出了部分连接自适应（PaCA），该方法对预训练权重内随机选取的部分连接进行微调，而非在模型中引入适配器层。PaCA不仅通过消除适配器层与预训练层顺序处理带来的时间开销来提升训练速度，还减少了激活内存，因为梯度计算只需存储部分激活而非全部激活。与LoRA相比，PaCA在多种微调场景（如在MMLU数据集上进行微调及在Oasst1数据集上进行指令调优）中保持相当精度的同时，将训练时间降低了22%，总内存使用量减少了16%。PaCA还可与量化技术结合，从而实现对LLaMA3.1-70B等大型模型的微调。此外，与LoRA相比，PaCA在NVIDIA A100 GPU和INTEL Gaudi2 HPU上均支持序列长度延长23%的训练，并将吞吐量提升了16%。代码发布于https://github.com/WooSunghyeon/paca。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日