The recent advancement of foundation models (FMs) has brought about a paradigm shift, revolutionizing various sectors worldwide. The popular optimizers used to train these models are stochastic gradient descent-based algorithms, which face inherent limitations, such as slow convergence and stringent assumptions for convergence. In particular, data heterogeneity arising from distributed settings poses significant challenges to their theoretical and numerical performance. This paper develops an algorithm, PISA (Preconditioned Inexact Stochastic Alternating Direction Method of Multipliers). Grounded in rigorous theoretical guarantees, the algorithm converges under the sole assumption of Lipschitz continuity of the gradient on a bounded region, thereby removing the need for other conditions commonly imposed by stochastic methods. This capability enables the proposed algorithm to tackle the challenge of data heterogeneity effectively. Moreover, the algorithmic architecture enables scalable parallel computing and supports various preconditions, such as second-order information, second moment, and orthogonalized momentum by Newton-Schulz iterations. Incorporating the latter two preconditions in PISA yields two computationally efficient variants: SISA and NSISA. Comprehensive experimental evaluations for training or fine-tuning diverse deep models, including vision models, large language models, reinforcement learning models, generative adversarial networks, and recurrent neural networks, demonstrate superior numerical performance of SISA and NSISA compared to various state-of-the-art optimizers.
翻译:基础模型(FMs)的最新进展引发了一场范式转变,深刻变革了全球多个领域。当前训练这些模型的主流优化器是基于随机梯度下降的算法,这些算法存在固有的局限性,例如收敛速度慢且需要严格的收敛假设。特别是在分布式环境中出现的数据异构性,对其理论和数值性能构成了重大挑战。本文提出了一种算法——PISA(预条件化非精确随机交替方向乘子法)。该算法基于严格的理论保证,仅需梯度在有界区域上满足Lipschitz连续性这一假设即可收敛,从而免除了随机方法通常所需的其他条件。这一特性使所提算法能有效应对数据异构性的挑战。此外,该算法架构支持可扩展的并行计算,并能兼容多种预条件子,例如二阶信息、二阶矩以及通过Newton-Schulz迭代实现的正交化动量。在PISA中引入后两种预条件子,可得到两个计算高效的变体:SISA与NSISA。通过对视觉模型、大语言模型、强化学习模型、生成对抗网络和循环神经网络等多种深度模型进行训练或微调的全面实验评估,结果表明SISA和NSISA相较于多种先进优化器具有更优越的数值性能。