Large language models (LLMs) have recently emerged as powerful tools for tackling many language-processing tasks. Despite their success, training and fine-tuning these models is still far too computationally and memory intensive. In this paper, we identify and characterise the important components needed for effective model convergence using gradient descent. In doing so we find that the intermediate activations used to implement backpropagation can be excessively compressed without incurring any degradation in performance. This result leads us to a cheap and memory-efficient algorithm for both fine-tuning and pre-training LLMs. The proposed algorithm simply divides the tokens up into smaller sub-tokens before projecting them onto a fixed 1-dimensional subspace during the forward pass. These features are then coarsely reconstructed during the backward pass to implement the update rules. We confirm the effectiveness of our algorithm as being complimentary to many state-of-the-art PEFT methods on the VTAB-1k fine-tuning benchmark. Furthermore, we outperform QLoRA for fine-tuning LLaMA and show competitive performance against other memory-efficient pre-training methods on the large-scale C4 dataset.
翻译:大型语言模型(LLMs)近年来已成为处理众多语言任务的有力工具。尽管取得了成功,但这些模型的训练与微调仍存在极高的计算与内存开销。本文通过梯度下降方法识别并刻画了实现有效模型收敛所需的关键组件。在此过程中,我们发现用于实现反向传播的中间激活值可被大幅压缩而不会导致性能下降。这一结论引导我们提出一种适用于LLMs微调与预训练的廉价且内存高效的算法。该算法在前向传播过程中,首先将令牌分割为更小的子令牌,随后将其投影到固定的1维子空间上。在反向传播过程中,这些特征被粗略重构以执行参数更新规则。我们在VTAB-1k微调基准测试中验证了本算法与多种先进参数高效微调(PEFT)方法的互补性。此外,在LLaMA微调任务中我们的方法性能优于QLoRA,并在大规模C4数据集上与其他内存高效预训练方法相比展现出具有竞争力的性能。