Fine-tuning large language models (LLMs) with zeroth-order (ZO) optimization reduces memory by approximating gradients through function evaluations, but existing methods rely on one-step gradient estimates from random perturbations. We introduce Bayesian Subspace Zeroth-Order optimization (BSZO), a ZO optimizer that applies Kalman filtering to combine finite-difference information across multiple perturbation directions. By treating each finite-difference measurement as a noisy observation, BSZO builds a posterior distribution over the projected gradient and updates it through Bayesian inference, with a residual-based adaptive mechanism to adjust perturbation scales. Theoretical analysis shows that BSZO improves the convergence rate by a factor of $k/γ$ compared to standard ZO methods. Experiments on RoBERTa, Mistral, and OPT models show that BSZO outperforms MeZO, MeZO-Adam, and HiZOO across various tasks, achieving up to 6.67\% absolute average improvement on OPT-13B while keeping memory usage close to inference-only baselines (1.00$\times$--1.08$\times$ of MeZO).
翻译:通过零阶优化对大语言模型进行微调,可利用函数评估近似梯度以降低内存需求,但现有方法依赖于随机扰动产生的一步梯度估计。本文提出贝叶斯子空间零阶优化方法,这是一种应用卡尔曼滤波融合多个扰动方向有限差分信息的零阶优化器。BSZO将每个有限差分测量视为含噪观测,通过贝叶斯推断构建投影梯度的后验分布并进行更新,同时采用基于残差的自适应机制调整扰动尺度。理论分析表明,与标准零阶方法相比,BSZO将收敛速度提升了$k/γ$倍。在RoBERTa、Mistral和OPT模型上的实验表明,BSZO在多项任务中均优于MeZO、MeZO-Adam和HiZOO方法,在OPT-13B模型上实现了最高6.67%的绝对平均性能提升,同时保持内存使用量接近纯推理基线水平(为MeZO的1.00$\times$--1.08$\times$)。