Balancing convergence speed, generalization capability, and computational efficiency remains a core challenge in deep learning optimization. First-order gradient descent methods, epitomized by stochastic gradient descent (SGD) and Adam, serve as the cornerstone of modern training pipelines. However, large-scale model training, stringent differential privacy requirements, and distributed learning paradigms expose critical limitations in these conventional approaches regarding privacy protection and memory efficiency. To mitigate these bottlenecks, researchers explore second-order optimization techniques to surpass first-order performance ceilings, while zeroth-order methods reemerge to alleviate memory constraints inherent to large-scale training. Despite this proliferation of methodologies, the field lacks a cohesive framework that unifies underlying principles and delineates application scenarios for these disparate approaches. In this work, we retrospectively analyze the evolutionary trajectory of deep learning optimization algorithms and present a comprehensive empirical evaluation of mainstream optimizers across diverse model architectures and training scenarios. We distill key emerging trends and fundamental design trade-offs, pinpointing promising directions for future research. By synthesizing theoretical insights with extensive empirical evidence, we provide actionable guidance for designing next-generation highly efficient, robust, and trustworthy optimization methods. The code is available at https://github.com/APRIL-AIGC/Awesome-Optimizer.
翻译:平衡收敛速度、泛化能力与计算效率始终是深度学习优化的核心挑战。以随机梯度下降(SGD)和Adam为代表的一阶梯度方法,构成了现代训练流程的基石。然而,大规模模型训练、严格的差分隐私要求以及分布式学习范式,暴露了这些传统方法在隐私保护与内存效率方面的关键局限。为缓解这些瓶颈,研究者探索二阶优化技术以突破一阶方法性能上限,同时零阶方法重新兴起以缓解大规模训练固有的内存约束。尽管方法论层出不穷,该领域仍缺乏统一的理论框架来整合底层原理并阐明不同方法的适用场景。本文回顾性分析深度学习优化算法的演进轨迹,对主流优化器在多类模型架构与训练场景下进行全面的实证评估,提炼关键新兴趋势与核心设计权衡,指明未来研究的有前景方向。通过融合理论洞见与广泛实验证据,我们为设计下一代高效、稳健且可信的优化方法提供可操作指导。代码详见 https://github.com/APRIL-AIGC/Awesome-Optimizer。