Fine-tuning Large Language Models (LLMs) has proven effective for a variety of downstream tasks. However, as LLMs grow in size, the memory demands for backpropagation become increasingly prohibitive. Zeroth-order (ZO) optimization methods offer a memory-efficient alternative by using forward passes to estimate gradients, but the variance of gradient estimates typically scales linearly with the model's parameter dimension$\unicode{x2013}$a significant issue for LLMs. In this paper, we propose the random Subspace Zeroth-order (SubZero) optimization to address the challenges posed by LLMs' high dimensionality. We introduce a low-rank perturbation tailored for LLMs that significantly reduces memory consumption while improving training performance. Additionally, we prove that our gradient estimation closely approximates the backpropagation gradient, exhibits lower variance than traditional ZO methods, and ensures convergence when combined with SGD. Experimental results show that SubZero enhances fine-tuning performance and achieves faster convergence compared to standard ZO approaches like MeZO across various language modeling tasks.
翻译:微调大型语言模型(LLM)已被证明对多种下游任务具有显著效果。然而,随着LLM规模的增长,反向传播所需的内存需求日益成为瓶颈。零阶(ZO)优化方法通过使用前向传播来估计梯度,提供了内存高效的替代方案,但梯度估计的方差通常随模型参数维度线性增长——这对LLM而言是一个严峻问题。本文提出随机子空间零阶(SubZero)优化方法,以应对LLM高维性带来的挑战。我们引入一种专为LLM设计的低秩扰动机制,在显著降低内存消耗的同时提升训练性能。此外,我们证明了所提梯度估计方法能紧密逼近反向传播梯度,其方差低于传统ZO方法,且与随机梯度下降(SGD)结合时能确保收敛性。实验结果表明,在多种语言建模任务中,相较于MeZO等标准ZO方法,SubZero不仅提升了微调性能,还实现了更快的收敛速度。