Towards Guided Descent: Optimization Algorithms for Training Neural Networks At Scale

Neural network optimization remains one of the most consequential yet poorly understood challenges in modern AI research, where improvements in training algorithms can lead to enhanced feature learning in foundation models, order-of-magnitude reductions in training time, and improved interpretability into how networks learn. While stochastic gradient descent (SGD) and its variants have become the de facto standard for training deep networks, their success in these over-parameterized regimes often appears more empirical than principled. This thesis investigates this apparent paradox by tracing the evolution of optimization algorithms from classical first-order methods to modern higher-order techniques, revealing how principled algorithmic design can demystify the training process. Starting from first principles with SGD and adaptive gradient methods, the analysis progressively uncovers the limitations of these conventional approaches when confronted with anisotropy that is representative of real-world data. These breakdowns motivate the exploration of sophisticated alternatives rooted in curvature information: second-order approximation techniques, layer-wise preconditioning, adaptive learning rates, and more. Next, the interplay between these optimization algorithms and the broader neural network training toolkit, which includes prior and recent developments such as maximal update parametrization, learning rate schedules, and exponential moving averages, emerges as equally essential to empirical success. To bridge the gap between theoretical understanding and practical deployment, this paper offers practical prescriptions and implementation strategies for integrating these methods into modern deep learning workflows.

翻译：神经网络优化依然是现代人工智能研究中影响最为深远却理解最为不足的挑战之一——训练算法的改进能够推动基础模型的特征学习能力提升、实现训练时间的数量级缩减，并增强对网络学习机制的可解释性。尽管随机梯度下降（SGD）及其变体已成为训练深度网络的事实标准，但它们在过参数化场景中的成功往往更多基于经验而非理论原则。本论文通过追溯优化算法从经典一阶方法到现代高阶技术的演进历程来探究这一表面悖论，揭示基于原理的算法设计如何能够阐明训练过程的本质。从SGD与自适应梯度方法的基本原理出发，分析逐步揭示了这些传统方法在面对现实数据所呈现的各向异性时的局限性。这些失效机制促使我们探索基于曲率信息的复杂替代方案：二阶近似技术、分层预条件处理、自适应学习率等。进一步地，这些优化算法与更广泛的神经网络训练工具（包括早期及最新进展，如最大更新参数化、学习率调度策略和指数移动平均方法）之间的相互作用，被证明对实际成功同等关键。为弥合理论理解与实际部署之间的鸿沟，本文提出了将这些方法整合到现代深度学习工作流程中的实践方案与实施策略。