Zeroth-order (ZO) methods are widely used when gradients are unavailable or prohibitively expensive, including black-box learning and memory-efficient fine-tuning of large models, yet their optimization dynamics in deep learning remain underexplored. In this work, we provide an explicit step size condition that exactly captures the (mean-square) linear stability of a family of ZO methods based on the standard two-point estimator. Our characterization reveals a sharp contrast with first-order (FO) methods: whereas FO stability is governed solely by the largest Hessian eigenvalue, mean-square stability of ZO methods depends on the entire Hessian spectrum. Since computing the full Hessian spectrum is infeasible in practical neural network training, we further derive tractable stability bounds that depend only on the largest eigenvalue and the Hessian trace. Empirically, we find that full-batch ZO methods operate at the edge of stability: ZO-GD, ZO-GDM, and ZO-Adam consistently stabilize near the predicted stability boundary across a range of deep learning training problems. Our results highlight an implicit regularization effect specific to ZO methods, where large step sizes primarily regularize the Hessian trace, whereas in FO methods they regularize the top eigenvalue.
翻译:零阶方法在梯度不可用或成本过高(包括黑箱学习和大模型的内存高效微调)时被广泛使用,但其在深度学习中的优化动力学仍然未被充分探索。本文针对基于标准两点估计器的零阶方法族,给出了精确刻画其(均方)线性稳定性的显式步长条件。我们的特征揭示了一个与一阶方法的显著差异:一阶稳定性仅由海森矩阵的最大特征值决定,而零阶方法的均方稳定性则依赖于海森矩阵的全谱。由于在实际神经网络训练中计算完整海森谱不可行,我们进一步推导出仅依赖于最大特征值和海森矩阵迹的可处理稳定性边界。实验表明,全批次零阶方法工作在稳定性边缘:在一系列深度学习训练问题中,ZO-GD、ZO-GDM和ZO-Adam始终稳定在预测的稳定性边界附近。我们的结果凸显了零阶方法特有的隐式正则化效应——大步长主要正则化海森矩阵的迹,而在传统一阶方法中,大步长则正则化最大特征值。