Given the massive cost of language model pre-training, a non-trivial improvement of the optimization algorithm would lead to a material reduction on the time and cost of training. Adam and its variants have been state-of-the-art for years, and more sophisticated second-order (Hessian-based) optimizers often incur too much per-step overhead. In this paper, we propose Sophia, Second-order Clipped Stochastic Optimization, a simple scalable second-order optimizer that uses a light-weight estimate of the diagonal Hessian as the pre-conditioner. The update is the moving average of the gradients divided by the moving average of the estimated Hessian, followed by element-wise clipping. The clipping controls the worst-case update size and tames the negative impact of non-convexity and rapid change of Hessian along the trajectory. Sophia only estimates the diagonal Hessian every handful of iterations, which has negligible average per-step time and memory overhead. On language modeling with GPT-2 models of sizes ranging from 125M to 770M, Sophia achieves a 2x speed-up compared with Adam in the number of steps, total compute, and wall-clock time. Theoretically, we show that Sophia adapts to the curvature in different components of the parameters, which can be highly heterogeneous for language modeling tasks. Our run-time bound does not depend on the condition number of the loss.
翻译:鉴于语言模型预训练的巨大成本,优化算法的实质性改进将显著减少训练时间和成本。Adam及其变体多年来一直处于领先地位,而更复杂的二阶(基于Hessian矩阵)优化器往往因每步开销过大而受限。本文提出Sophia(二阶裁剪随机优化),一种简单且可扩展的二阶优化器,其使用轻量级的对角Hessian估计作为预处理器。更新规则为梯度移动平均值除以估计Hessian移动平均值,随后进行逐元素裁剪。裁剪操作可控制最坏情况下的更新幅度,并抑制非凸性以及沿轨迹Hessian快速变化带来的负面影响。Sophia仅每隔少量迭代估计一次对角Hessian,其平均每步时间和内存开销可忽略不计。在参数量从125M到770M的GPT-2模型语言建模任务中,Sophia在步数、总计算量和实际运行时间上均实现了相比Adam的2倍加速。理论上,我们证明Sophia能自适应参数不同分量中的曲率特性,而语言建模任务中这些曲率可能存在高度异质性。我们的运行时界不依赖于损失函数的条件数。