With the rapid growth in model size, fine-tuning the large pre-trained language model has become increasingly difficult due to its extensive memory usage. Previous works usually focus on reducing the number of trainable parameters in the network. While the model parameters do contribute to memory usage, the primary memory bottleneck during training arises from storing feature maps, also known as activations, as they are crucial for gradient calculation. Notably, neural networks are usually trained using stochastic gradient descent. We argue that in stochastic optimization, models can handle noisy gradients as long as the gradient estimator is unbiased with reasonable variance. Following this motivation, we propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance, which only requires storing the sub-sampled activations for calculating the gradient. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones. By replacing the linear operation with our approximated one in transformers, we can achieve up to 2.7$\times$ peak memory reduction with almost no accuracy drop and enables up to $6.4\times$ larger batch size. Under the same hardware, WTA-CRS enables better down-streaming task performance by applying larger models and/or faster training speed with larger batch sizes.
翻译:随着模型规模的快速增长,微调大型预训练语言模型因其巨大的内存消耗而变得愈发困难。以往研究通常侧重于减少网络中可训练参数的数量。尽管模型参数确实会占用内存,但训练过程中的主要内存瓶颈源于存储特征图(即激活值),因为其对梯度计算至关重要。值得注意的是,神经网络通常采用随机梯度下降法进行训练。我们认为,在随机优化中,只要梯度估计器无偏且方差合理,模型能够容忍含噪声的梯度。基于这一动机,我们提出了一类新的无偏估计器——WTA-CRS,用于降低矩阵乘法的方差,其仅需存储子采样后的激活值即可计算梯度。我们的工作从理论和实验两方面证明,在微调Transformer的场景下,所提估计器相比现有方法具有更低的方差。通过用我们所提出的近似线性操作替代Transformer中的标准线性操作,我们能够在几乎不损失精度的情况下,将峰值内存降低高达2.7倍,并支持高达6.4倍的批处理规模。在相同硬件条件下,WTA-CRS通过应用更大规模的模型和/或利用更大批处理规模提升训练速度,从而在下游任务中实现更优性能。