Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark

Yihua Zhang,Pingzhi Li,Junyuan Hong,Jiaxiang Li,Yimeng Zhang,Wenqing Zheng,Pin-Yu Chen,Jason D. Lee,Wotao Yin,Mingyi Hong,Zhangyang Wang,Sijia Liu,Tianlong Chen

In the evolving landscape of natural language processing (NLP), fine-tuning pre-trained Large Language Models (LLMs) with first-order (FO) optimizers like SGD and Adam has become standard. Yet, as LLMs grow {in size}, the substantial memory overhead from back-propagation (BP) for FO gradient computation presents a significant challenge. Addressing this issue is crucial, especially for applications like on-device training where memory efficiency is paramount. This paper proposes a shift towards BP-free, zeroth-order (ZO) optimization as a solution for reducing memory costs during LLM fine-tuning, building on the initial concept introduced by MeZO. Unlike traditional ZO-SGD methods, our work expands the exploration to a wider array of ZO optimization techniques, through a comprehensive, first-of-its-kind benchmarking study across five LLM families (Roberta, OPT, LLaMA, Vicuna, Mistral), three task complexities, and five fine-tuning schemes. Our study unveils previously overlooked optimization principles, highlighting the importance of task alignment, the role of the forward gradient method, and the balance between algorithm complexity and fine-tuning performance. We further introduce novel enhancements to ZO optimization, including block-wise descent, hybrid training, and gradient sparsity. Our study offers a promising direction for achieving further memory-efficient LLM fine-tuning. Codes to reproduce all our experiments are at https://github.com/ZO-Bench/ZO-LLM .

翻译：在自然语言处理（NLP）不断发展的格局中，使用SGD和Adam等一阶（FO）优化器对预训练大语言模型（LLM）进行微调已成为标准做法。然而，随着LLM规模的增长，用于FO梯度计算的反向传播（BP）所带来的显著显存开销构成了重大挑战。解决这一问题至关重要，尤其是在显存效率至关重要的设备端训练等应用场景中。本文基于MeZO提出的初始概念，提出转向无BP的零阶（ZO）优化，作为降低LLM微调过程中显存成本的解决方案。与传统ZO-SGD方法不同，我们的工作通过一项首次、全面的基准研究，将探索范围扩展至更广泛的ZO优化技术，涵盖五个LLM系列（Roberta、OPT、LLaMA、Vicuna、Mistral）、三种任务复杂度以及五种微调方案。我们的研究揭示了此前被忽视的优化原理，强调了任务对齐的重要性、前向梯度方法的作用，以及算法复杂度与微调性能之间的平衡。我们进一步引入了ZO优化的新颖改进，包括分块下降、混合训练和梯度稀疏性。我们的研究为实现更高效的显存LLM微调提供了有前景的方向。用于复现所有实验的代码请见https://github.com/ZO-Bench/ZO-LLM。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日