Lion (Evolved Sign Momentum), a new optimizer discovered through program search, has shown promising results in training large AI models. It performs comparably or favorably to AdamW but with greater memory efficiency. As we can expect from the results of a random search program, Lion incorporates elements from several existing algorithms, including signed momentum, decoupled weight decay, Polak, and Nesterov momentum, but does not fit into any existing category of theoretically grounded optimizers. Thus, even though Lion appears to perform well as a general-purpose optimizer for a wide range of tasks, its theoretical basis remains uncertain. This lack of theoretical clarity limits opportunities to further enhance and expand Lion's efficacy. This work aims to demystify Lion. Based on both continuous-time and discrete-time analysis, we demonstrate that Lion is a theoretically novel and principled approach for minimizing a general loss function $f(x)$ while enforcing a bound constraint $\|x\|_\infty \leq 1/\lambda$. Lion achieves this through the incorporation of decoupled weight decay, where $\lambda$ represents the weight decay coefficient. Our analysis is made possible by the development of a new Lyapunov function for the Lion updates. It applies to a broader family of Lion-$\kappa$ algorithms, where the $\text{sign}(\cdot)$ operator in Lion is replaced by the subgradient of a convex function $\kappa$, leading to the solution of a general composite optimization problem of $\min_x f(x) + \kappa^*(x)$. Our findings provide valuable insights into the dynamics of Lion and pave the way for further improvements and extensions of Lion-related algorithms.
翻译:Lion(进化符号动量)是一种通过程序搜索发现的新型优化器,在训练大规模AI模型方面展现出令人期待的性能。其表现与AdamW相当甚至更优,同时具有更高的内存效率。正如随机搜索程序的结果所预期,Lion融合了多种现有算法的要素,包括符号动量、解耦权重衰减、Polak和Nesterov动量,但并未归入任何现有理论完备的优化器类别。因此,尽管Lion作为通用优化器在广泛任务中表现优异,其理论基础仍不明确。这种理论模糊性限制了进一步改进和拓展Lion效能的机遇。本文旨在阐明Lion的本质。通过连续时间与离散时间的分析,我们证明Lion是一种在理论上新颖且严谨的方法,用于在施加边界约束$\|x\|_\infty \leq 1/\lambda$的同时,最小化一般损失函数$f(x)$。Lion通过引入解耦权重衰减实现这一点,其中$\lambda$代表权重衰减系数。我们的分析得益于为Lion更新步骤开发的新型李雅普诺夫函数。该方法适用于更广泛的Lion-$\kappa$算法族,其中Lion中的$\text{sign}(\cdot)$算子被凸函数$\kappa$的次梯度替代,从而求解一般复合优化问题$\min_x f(x) + \kappa^*(x)$。我们的研究为理解Lion的动力学机制提供了宝贵见解,并为Lion相关算法的进一步优化与扩展铺平了道路。