LoRA is a technique that reduces the number of trainable parameters in a neural network by introducing low-rank adapters to linear layers. This technique is used both for fine-tuning (LoRA, QLoRA) and full train (ReLoRA). This paper presents the RunLoRA framework for efficient implementations of LoRA that significantly improves the speed of neural network training and fine-tuning using low-rank adapters. The proposed implementation optimizes the computation of LoRA operations based on dimensions of corresponding linear layer, layer input dimensions and lora rank by choosing best forward and backward computation graph based on FLOPs and time estimations, resulting in faster training without sacrificing accuracy. The experimental results show up to 17% speedup on Llama family of models.
翻译:LoRA是一种通过在神经网络线性层中引入低秩适配器来减少可训练参数数量的技术。该技术被应用于微调(LoRA、QLoRA)和全参数训练(ReLoRA)。本文提出RunLoRA框架,用于高效实现LoRA,能够显著提升使用低秩适配器的神经网络训练和微调速度。所提出的实现方法基于对应线性层的维度、层输入维度及LoRA秩,通过根据FLOPs和时间估计选择最优的前向和反向计算图来优化LoRA运算,从而在保持精度的前提下实现更快的训练。实验结果表明,在Llama系列模型上可实现高达17%的加速。