Fine-tuning large language models (LLMs) using standard first-order (FO) optimization often drives training toward sharp, poorly generalizing minima. Conversely, zeroth-order (ZO) methods offer stronger exploratory behavior without relying on explicit gradients, yet suffer from slow convergence. More critically, our analysis reveals that in generative tasks, the vast output and search space significantly amplify estimation variance, rendering ZO methods both noisy and inefficient. To address these challenges, we propose \textbf{Hi-ZFO} (\textbf{Hi}erarchical \textbf{Z}eroth- and \textbf{F}irst-\textbf{O}rder optimization), a hybrid framework designed to synergize the precision of FO gradients with the exploratory capability of ZO estimation. Hi-ZFO adaptively partitions the model through layer-wise importance profiling, applying precise FO updates to critical layers while leveraging ZO optimization for less sensitive ones. Notably, ZO in Hi-ZFO is not merely a memory-saving surrogate; it is intentionally introduced as a source of "beneficial stochasticity" to help the model escape the local minima where pure FO optimization tends to stagnate. Validated across diverse generative, mathematical, and code reasoning tasks, Hi-ZFO consistently achieves superior performance while significantly reducing the training time. These results demonstrate the effectiveness of hierarchical hybrid optimization for LLM fine-tuning.


翻译:使用标准一阶(FO)优化方法对大语言模型(LLM)进行微调,常常会使训练陷入尖锐且泛化能力差的极小值点。相比之下,零阶(ZO)方法不依赖显式梯度,具有更强的探索能力,但其收敛速度缓慢。更重要的是,我们的分析表明,在生成式任务中,庞大的输出和搜索空间会显著放大估计方差,使得ZO方法既存在噪声又效率低下。为应对这些挑战,我们提出了\textbf{Hi-ZFO}(\textbf{Hi}erarchical \textbf{Z}eroth- and \textbf{F}irst-\textbf{O}rder optimization),一种混合优化框架,旨在协同一阶梯度的精确性与零阶估计的探索能力。Hi-ZFO通过逐层重要性分析自适应地对模型进行分区:对关键层应用精确的一阶更新,同时对敏感性较低的层采用零阶优化。值得注意的是,Hi-ZFO中的零阶优化不仅是一种节省内存的替代方案,更是被有意引入作为一种“有益随机性”的来源,以帮助模型逃离纯一阶优化容易陷入的局部极小值。在多种生成、数学和代码推理任务上的验证表明,Hi-ZFO在显著减少训练时间的同时,始终能取得更优的性能。这些结果证明了分层混合优化在大语言模型微调中的有效性。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员