Prior parameter-efficient fine-tuning (PEFT) algorithms reduce memory usage and computational costs of fine-tuning large neural network models by training only a few additional adapter parameters, rather than the entire model. However, the reduction in computational costs due to PEFT does not necessarily translate to a reduction in training time; although the computational costs of the adapter layers are much smaller than the pretrained layers, it is well known that those two types of layers are processed sequentially on GPUs, resulting in significant latency overhead. LoRA and its variants merge low-rank adapter matrices with pretrained weights during inference to avoid latency overhead, but during training, the pretrained weights remain frozen while the adapter matrices are continuously updated, preventing such merging. To mitigate this issue, we propose Partial Connection Adaptation (PaCA), which fine-tunes randomly selected partial connections within the pretrained weights instead of introducing adapter layers in the model. PaCA not only enhances training speed by eliminating the time overhead due to the sequential processing of the adapter and pretrained layers but also reduces activation memory since only partial activations, rather than full activations, need to be stored for gradient computation. Compared to LoRA, PaCA reduces training time by 22% and total memory usage by 16%, while maintaining comparable accuracy across various fine-tuning scenarios, such as fine-tuning on the MMLU dataset and instruction tuning on the Oasst1 dataset. PaCA can also be combined with quantization, enabling the fine-tuning of large models such as LLaMA3.1-70B. In addition, PaCA enables training with 23% longer sequence and improves throughput by 16% on both NVIDIA A100 GPU and INTEL Gaudi2 HPU compared to LoRA. The code is available at https://github.com/WooSunghyeon/paca.
翻译:先前的参数高效微调(PEFT)算法通过仅训练少量额外的适配器参数,而非整个模型,来降低大型神经网络模型微调的内存占用与计算成本。然而,PEFT带来的计算成本降低并不必然转化为训练时间的减少;尽管适配器层的计算成本远低于预训练层,但众所周知,这两类层在GPU上是顺序处理的,从而产生了显著的延迟开销。LoRA及其变体在推理期间将低秩适配器矩阵与预训练权重合并以避免延迟开销,但在训练期间,预训练权重保持冻结而适配器矩阵持续更新,阻碍了此类合并。为缓解此问题,我们提出了部分连接自适应(PaCA),该方法对预训练权重内随机选取的部分连接进行微调,而非在模型中引入适配器层。PaCA不仅通过消除适配器层与预训练层顺序处理带来的时间开销来提升训练速度,还减少了激活内存,因为梯度计算只需存储部分激活而非全部激活。与LoRA相比,PaCA在多种微调场景(如在MMLU数据集上进行微调及在Oasst1数据集上进行指令调优)中保持相当精度的同时,将训练时间降低了22%,总内存使用量减少了16%。PaCA还可与量化技术结合,从而实现对LLaMA3.1-70B等大型模型的微调。此外,与LoRA相比,PaCA在NVIDIA A100 GPU和INTEL Gaudi2 HPU上均支持序列长度延长23%的训练,并将吞吐量提升了16%。代码发布于https://github.com/WooSunghyeon/paca。