In the evolving landscape of natural language processing (NLP), fine-tuning pre-trained Large Language Models (LLMs) with first-order (FO) optimizers like SGD and Adam has become standard. Yet, as LLMs grow {in size}, the substantial memory overhead from back-propagation (BP) for FO gradient computation presents a significant challenge. Addressing this issue is crucial, especially for applications like on-device training where memory efficiency is paramount. This paper proposes a shift towards BP-free, zeroth-order (ZO) optimization as a solution for reducing memory costs during LLM fine-tuning, building on the initial concept introduced by MeZO. Unlike traditional ZO-SGD methods, our work expands the exploration to a wider array of ZO optimization techniques, through a comprehensive, first-of-its-kind benchmarking study across five LLM families (Roberta, OPT, LLaMA, Vicuna, Mistral), three task complexities, and five fine-tuning schemes. Our study unveils previously overlooked optimization principles, highlighting the importance of task alignment, the role of the forward gradient method, and the balance between algorithm complexity and fine-tuning performance. We further introduce novel enhancements to ZO optimization, including block-wise descent, hybrid training, and gradient sparsity. Our study offers a promising direction for achieving further memory-efficient LLM fine-tuning. Codes to reproduce all our experiments are at https://github.com/ZO-Bench/ZO-LLM .
翻译:在自然语言处理(NLP)不断发展的格局中,使用SGD和Adam等一阶(FO)优化器对预训练大语言模型(LLM)进行微调已成为标准做法。然而,随着LLM规模的增长,用于FO梯度计算的反向传播(BP)所带来的显著显存开销构成了重大挑战。解决这一问题至关重要,尤其是在显存效率至关重要的设备端训练等应用场景中。本文基于MeZO提出的初始概念,提出转向无BP的零阶(ZO)优化,作为降低LLM微调过程中显存成本的解决方案。与传统ZO-SGD方法不同,我们的工作通过一项首次、全面的基准研究,将探索范围扩展至更广泛的ZO优化技术,涵盖五个LLM系列(Roberta、OPT、LLaMA、Vicuna、Mistral)、三种任务复杂度以及五种微调方案。我们的研究揭示了此前被忽视的优化原理,强调了任务对齐的重要性、前向梯度方法的作用,以及算法复杂度与微调性能之间的平衡。我们进一步引入了ZO优化的新颖改进,包括分块下降、混合训练和梯度稀疏性。我们的研究为实现更高效的显存LLM微调提供了有前景的方向。用于复现所有实验的代码请见https://github.com/ZO-Bench/ZO-LLM。