Training large models ranging from millions to billions of parameters is highly resource-intensive, requiring significant time, compute, and memory. It is observed that most of the learning (higher change in weights) takes place in the earlier stage of the training loop. As training progresses, these changes stabilize, suggesting that the resulting updates may be amenable to approximation using low intrinsic-rank matrices. Therefore, we propose an approach to identify such states of partial convergence and dynamically switch from full parameter training to Low Rank Adaptation (LoRA) on the ViT-Large model. We introduce a flexible approach that leverages user-defined hyperparameters to determine the switching point and assign a rank specific to each module layer based on its level of convergence. Experimental results show that this approach preserves model accuracy while reducing the number of trainable parameters to 10% of its original size, resulting in a 3x improvement in throughput, and a 1.5x reduction in average training time per epoch while also reducing GPU memory consumption by 20%.
翻译:训练参数规模从数百万到数十亿的大型模型需要极高的资源投入,包括大量的时间、计算资源和内存。研究发现,大部分学习过程(权重的显著变化)发生在训练循环的早期阶段。随着训练的进行,这些变化逐渐趋于稳定,这表明后续的参数更新可能适合用低内在秩矩阵进行近似。因此,我们提出了一种方法来识别这种部分收敛状态,并在ViT-Large模型上动态地从全参数训练切换至低秩自适应(LoRA)。我们引入了一种灵活的方法,利用用户定义的超参数来确定切换时机,并根据每个模块层的收敛程度为其分配特定的秩。实验结果表明,该方法在保持模型精度的同时,将可训练参数数量减少至原始规模的10%,实现了3倍的吞吐量提升和每轮平均训练时间1.5倍的缩短,同时GPU内存消耗降低了20%。