FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training

With the increase in the number of parameters in large language models, the process of pre-training and fine-tuning increasingly demands larger volumes of GPU memory. A significant portion of this memory is typically consumed by the optimizer state. To overcome this challenge, recent approaches such as low-rank adaptation (LoRA (Hu et al., 2021)), low-rank gradient projection (GaLore (Zhao et al., 2024)), and blockwise optimization (BAdam (Luo et al., 2024)) have been proposed. However, in all these algorithms, the $\textit{effective rank of the weight updates remains low-rank}$, which can lead to a substantial loss of information from the gradient. This loss can be critically important, especially during the pre-training stage. In this paper, we introduce $\texttt{FRUGAL}$ ($\textbf{F}$ull-$\textbf{R}$ank $\textbf{U}$pdates with $\textbf{G}$r$\textbf{A}$dient sp$\textbf{L}$itting), a new memory-efficient optimization framework. $\texttt{FRUGAL}$ leverages gradient splitting to perform low-dimensional updates using advanced algorithms (such as Adam), while updates along the remaining directions are executed via state-free methods like SGD or signSGD (Bernstein et al., 2018). Our framework can be integrated with various low-rank update selection techniques, including GaLore and BAdam. We provide theoretical convergence guarantees for our framework when using SGDM for low-dimensional updates and SGD for state-free updates. Additionally, our method consistently outperforms concurrent approaches across various fixed memory budgets, achieving state-of-the-art results in pre-training and fine-tuning tasks while balancing memory efficiency and performance metrics.

翻译：随着大语言模型参数数量的增加，预训练和微调过程对GPU内存的需求日益增大。其中，优化器状态通常占据了内存消耗的显著部分。为应对这一挑战，近期提出了诸如低秩适应（LoRA (Hu et al., 2021)）、低秩梯度投影（GaLore (Zhao et al., 2024)）以及分块优化（BAdam (Luo et al., 2024)）等方法。然而，在这些算法中，权重更新的有效秩始终保持低秩，这可能导致梯度信息的大量丢失。这种信息损失在预训练阶段尤为关键。本文提出一种新型内存高效优化框架——FRUGAL（全秩更新与梯度分割）。该框架通过梯度分割技术，在低维子空间上采用高级优化算法（如Adam）执行更新，同时在剩余方向上使用无状态方法（如SGD或signSGD (Bernstein et al., 2018)）进行更新。本框架可与多种低秩更新选择技术（包括GaLore和BAdam）相结合。我们为采用SGDM执行低维更新、SGD执行无状态更新的配置提供了理论收敛性保证。此外，在多种固定内存预算下，本方法持续优于同期方案，在预训练与微调任务中实现了最优性能，同时平衡了内存效率与性能指标。