Thinking Forward: Memory-Efficient Federated Finetuning of Language Models

Finetuning large language models (LLMs) in federated learning (FL) settings has become important as it allows resource-constrained devices to finetune a model using private data. However, finetuning LLMs using backpropagation requires excessive memory (especially from intermediate activations) for resource-constrained devices. While Forward-mode Auto-Differentiation (AD) can reduce memory footprint from activations, we observe that directly applying it to LLM finetuning results in slow convergence and poor accuracy. This work introduces Spry, an FL algorithm that splits trainable weights of an LLM among participating clients, such that each client computes gradients using Forward-mode AD that are closer estimates of the true gradients. Spry achieves a low memory footprint, high accuracy, and fast convergence. We theoretically show that the global gradients in Spry are unbiased estimates of true global gradients for homogeneous data distributions across clients, while heterogeneity increases bias of the estimates. We also derive Spry's convergence rate, showing that the gradients decrease inversely proportional to the number of FL rounds, indicating the convergence up to the limits of heterogeneity. Empirically, Spry reduces the memory footprint during training by 1.4-7.1$\times$ in contrast to backpropagation, while reaching comparable accuracy, across a wide range of language tasks, models, and FL settings. Spry reduces the convergence time by 1.2-20.3$\times$ and achieves 5.2-13.5\% higher accuracy against state-of-the-art zero-order methods. When finetuning Llama2-7B with LoRA, compared to the peak memory usage of 33.9GB of backpropagation, Spry only consumes 6.2GB of peak memory. For OPT13B, the reduction is from 76.5GB to 10.8GB. Spry makes feasible previously impossible FL deployments on commodity mobile and edge devices. Source code is available at https://github.com/Astuary/Spry.

翻译：在联邦学习（FL）环境中微调大型语言模型（LLM）已变得至关重要，因为它允许资源受限的设备利用私有数据进行模型微调。然而，使用反向传播微调LLM需要大量内存（尤其是中间激活值），这对资源受限设备构成了挑战。虽然前向模式自动微分（AD）可以减少激活值的内存占用，但我们观察到直接将其应用于LLM微调会导致收敛缓慢和准确率下降。本文提出Spry算法，该FL算法将LLM的可训练权重分配到参与客户端，使每个客户端使用前向模式AD计算梯度，从而获得更接近真实梯度的估计值。Spry实现了低内存占用、高准确率和快速收敛。我们从理论上证明，在客户端数据分布同质的情况下，Spry的全局梯度是真实全局梯度的无偏估计；而数据异质性会增加估计偏差。我们还推导了Spry的收敛速率，表明梯度下降与FL轮数成反比，这意味着收敛可达异质性限制下的最优值。实验表明，在多种语言任务、模型和FL设置中，相较于反向传播，Spry将训练期间的内存占用降低了1.4-7.1倍，同时达到可比准确率。与最先进的零阶方法相比，Spry将收敛时间缩短了1.2-20.3倍，并实现了5.2-13.5%的准确率提升。在使用LoRA微调Llama2-7B时，反向传播的峰值内存使用量为33.9GB，而Spry仅消耗6.2GB峰值内存。对于OPT13B模型，内存占用从76.5GB降至10.8GB。Spry使得在商用移动和边缘设备上部署原本不可行的FL应用成为可能。源代码发布于https://github.com/Astuary/Spry。