Full-parameter fine-tuning has become the go-to choice for adapting language models (LMs) to downstream tasks due to its excellent performance. As LMs grow in size, fine-tuning the full parameters of LMs requires a prohibitively large amount of GPU memory. Existing approaches utilize zeroth-order optimizer to conserve GPU memory, which can potentially compromise the performance of LMs as non-zero order optimizers tend to converge more readily on most downstream tasks. In this paper, we propose a novel optimizer-independent end-to-end hierarchical fine-tuning strategy, HiFT, which only updates a subset of parameters at each training step. HiFT can significantly reduce the amount of gradients and optimizer state parameters residing in GPU memory at the same time, thereby reducing GPU memory usage. Our results demonstrate that: (1) HiFT achieves comparable performance to parameter-efficient fine-tuning and standard full parameter fine-tuning. (2) HiFT supports various optimizers including AdamW, AdaGrad, SGD, etc. (3) HiFT can save more than 60\% GPU memory compared with standard full-parameter fine-tuning for 7B model. (4) HiFT enables full-parameter fine-tuning of a 7B model on single 48G A6000 with a precision of 32 using the AdamW optimizer, without using any memory saving techniques.
翻译:[translated abstract in Chinese]
全参数微调因其卓越性能已成为将语言模型适配至下游任务的首选方法。随着模型规模持续增长,对语言模型进行全参数微调需要极其庞大的GPU内存。现有方法采用零阶优化器来节省GPU内存,但这可能损害模型性能——因为非零阶优化器在大多数下游任务中更易收敛。本文提出一种与优化器无关的新型端到端层级微调策略HiFT,该策略在每次训练步骤中仅更新参数子集。HiFT能显著减少驻留于GPU内存中的梯度与优化器状态参数量,从而降低显存占用。实验结果表明:(1) HiFT的性能与参数高效微调及标准全参数微调相当;(2) HiFT支持包括AdamW、AdaGrad、SGD在内的多种优化器;(3) 针对7B模型,HiFT相比标准全参数微调可节省超过60%的GPU内存;(4) 在无需任何内存节省技术的情况下,HiFT可在单张48G A6000上使用AdamW优化器以32位精度完成7B模型的全参数微调。