Large language models (LLMs) have achieved significant success across various domains. However, training these LLMs typically involves substantial memory and computational costs during both forward and backward propagation. While parameter-efficient fine-tuning (PEFT) considerably reduces the training memory associated with parameters, it does not address the significant computational costs and activation memory. In this paper, we propose Dropping Backward Propagation (DropBP), a novel approach designed to reduce computational costs and activation memory while maintaining accuracy. DropBP randomly drops layers during backward propagation, which is essentially equivalent to training shallow submodules generated by undropped layers and residual connections. Additionally, DropBP calculates the sensitivity of each layer to assign an appropriate drop rate, thereby stabilizing the training process. DropBP is not only applicable to full fine-tuning but can also be orthogonally integrated with all types of PEFT by dropping layers during backward propagation. Specifically, DropBP can reduce training time by 44% with comparable accuracy to the baseline, accelerate convergence to the same perplexity by 1.5x, and enable training with a sequence length 6.2x larger on a single NVIDIA-A100 GPU. Furthermore, our DropBP enabled a throughput increase of 79% on a NVIDIA A100 GPU and 117% on an Intel Gaudi2 HPU. The code is available at https://github.com/WooSunghyeon/dropbp.
翻译:大语言模型(LLMs)已在多个领域取得显著成功。然而,训练这些LLMs通常在前向传播和反向传播阶段均需消耗大量内存与计算资源。尽管参数高效微调(PEFT)显著降低了与参数相关的训练内存,但其并未解决高昂的计算开销与激活内存占用问题。本文提出丢弃反向传播(DropBP),这是一种旨在降低计算成本与激活内存同时保持精度的新方法。DropBP在反向传播过程中随机丢弃部分层,其本质等效于训练由未丢弃层与残差连接构成的浅层子模块。此外,DropBP通过计算各层敏感度以分配适当的丢弃率,从而稳定训练过程。DropBP不仅适用于全参数微调,还可通过反向传播中的层丢弃机制,与各类PEFT方法正交集成。具体而言,DropBP能在保持与基线模型相当精度的前提下减少44%的训练时间,将模型收敛至相同困惑度的速度提升1.5倍,并在单张NVIDIA-A100 GPU上实现6.2倍序列长度的训练。进一步实验表明,DropBP在NVIDIA A100 GPU上实现了79%的吞吐量提升,在Intel Gaudi2 HPU上实现了117%的吞吐量提升。代码已开源:https://github.com/WooSunghyeon/dropbp。