We present a method to formulate algorithm discovery as program search, and apply it to discover optimization algorithms for deep neural network training. We leverage efficient search techniques to explore an infinite and sparse program space. To bridge the large generalization gap between proxy and target tasks, we also introduce program selection and simplification strategies. Our method discovers a simple and effective optimization algorithm, $\textbf{Lion}$ ($\textit{Evo$\textbf{L}$ved S$\textbf{i}$gn M$\textbf{o}$me$\textbf{n}$tum}$). It is more memory-efficient than Adam as it only keeps track of the momentum. Different from adaptive optimizers, its update has the same magnitude for each parameter calculated through the sign operation. We compare Lion with widely used optimizers, such as Adam and Adafactor, for training a variety of models on different tasks. On image classification, Lion boosts the accuracy of ViT by up to 2% on ImageNet and saves up to 5x the pre-training compute on JFT. On vision-language contrastive learning, we achieve 88.3% $\textit{zero-shot}$ and 91.1% $\textit{fine-tuning}$ accuracy on ImageNet, surpassing the previous best results by 2% and 0.1%, respectively. On diffusion models, Lion outperforms Adam by achieving a better FID score and reducing the training compute by up to 2.3x. For autoregressive, masked language modeling, and fine-tuning, Lion exhibits a similar or better performance compared to Adam. Our analysis of Lion reveals that its performance gain grows with the training batch size. It also requires a smaller learning rate than Adam due to the larger norm of the update produced by the sign function. Additionally, we examine the limitations of Lion and identify scenarios where its improvements are small or not statistically significant. The implementation of Lion is publicly available.
翻译:我们提出了一种将算法发现形式化为程序搜索的方法,并将其应用于深度学习神经网络训练的优化算法发现。我们利用高效的搜索技术来探索无限且稀疏的程序空间。为了弥合代理任务与目标任务之间巨大的泛化差距,我们还引入了程序选择与简化策略。我们的方法发现了一种简单而有效的优化算法——$\textbf{Lion}$($\textit{进化符号动量}$)。该算法比Adam更节省内存,因为它仅需跟踪动量。与自适应优化器不同,其更新通过符号运算为每个参数生成相同量级的更新值。我们将Lion与Adam、Adafactor等广泛使用的优化器进行了对比,用于在不同任务上训练多种模型。在图像分类任务中,Lion在ImageNet上将ViT的准确率提升高达2%,并在JFT上节省了高达5倍的预训练计算量。在视觉-语言对比学习中,我们在ImageNet上达到了88.3%的$\textit{零样本}$准确率和91.1%的$\textit{微调}$准确率,分别超越此前最佳结果2%和0.1%。在扩散模型中,Lion优于Adam,获得了更好的FID分数并将训练计算量减少了2.3倍。对于自回归、掩码语言建模及微调任务,Lion表现与Adam相当或更优。我们对Lion的分析表明,其性能提升随训练批次大小的增加而增长。由于符号函数产生的更新范数更大,Lion所需的学习率也小于Adam。此外,我们检验了Lion的局限性,并识别出其改进效果较小或不显著的场景。Lion的实现已公开提供。