Large foundation models, such as large language models, have performed exceptionally well in various application scenarios. Building or fully fine-tuning such large models is usually prohibitive due to either hardware budget or lack of access to backpropagation. The zeroth-order methods offer a promising direction for tackling this challenge, where only forward passes are needed to update the model. This paper introduces an efficient Stochastic Two-Point (S2P) approach within the gradient-free regime. We present the theoretical convergence properties of S2P under the general and relaxed smoothness assumptions, and the derived results help understand and inherently connect the two popular types of zeroth-order methods, basic random search and stochastic three-point method. The theoretical properties also shed light on a Variant of S2P (VS2P), through exploiting our new convergence properties that better represent the dynamics of deep models in training. Our comprehensive empirical results show that VS2P is highly effective in optimizing objectives for deep models. It outperforms or achieves competitive performance compared to standard methods across various model types and scales.
翻译:大型基础模型(如大语言模型)在各种应用场景中表现卓越。由于硬件预算限制或无法进行反向传播,构建或完全微调此类大模型通常难以实现。零阶方法为解决这一挑战提供了有前景的途径,其仅需前向传播即可更新模型。本文在无梯度优化框架内提出了一种高效的随机两点方法。我们在一般且宽松的光滑性假设下给出了S2P的理论收敛性质,所得结果有助于理解并内在连接两种主流的零阶方法:基本随机搜索与随机三点方法。该理论性质亦启发了S2P的一种变体,通过利用我们提出的、能更好表征深度模型训练动态的新收敛性质实现。全面的实验结果表明,VS2P在优化深度模型目标函数方面极具效力。相较于标准方法,其在多种模型类型与规模上均表现出更优或具有竞争力的性能。