Classical assumptions like strong convexity and Lipschitz smoothness often fail to capture the nature of deep learning optimization problems, which are typically non-convex and non-smooth, making traditional analyses less applicable. This study aims to elucidate the mechanisms of non-convex optimization in deep learning by extending the conventional notions of strong convexity and Lipschitz smoothness. By leveraging these concepts, we prove that, under the established constraints, the empirical risk minimization problem is equivalent to optimizing the local gradient norm and structural error, which together constitute the upper and lower bounds of the empirical risk. Furthermore, our analysis demonstrates that the stochastic gradient descent (SGD) algorithm can effectively minimize the local gradient norm. Additionally, techniques like skip connections, over-parameterization, and random parameter initialization are shown to help control the structural error. Ultimately, we validate the core conclusions of this paper through extensive experiments. Theoretical analysis and experimental results indicate that our findings provide new insights into the mechanisms of non-convex optimization in deep learning.
翻译:经典假设如强凸性与Lipschitz光滑性往往无法准确描述深度学习优化问题的本质,这类问题通常具有非凸与非光滑特性,使得传统分析方法难以适用。本研究旨在通过扩展传统的强凸性与Lipschitz光滑性概念,揭示深度学习非凸优化的内在机制。基于这些扩展概念,我们证明了在既定约束条件下,经验风险最小化问题等价于优化局部梯度范数与结构误差,这两者共同构成了经验风险的上下界。进一步分析表明,随机梯度下降(SGD)算法能有效最小化局部梯度范数。同时,跳跃连接、过参数化及随机参数初始化等技术被证明有助于控制结构误差。最终,我们通过大量实验验证了本文的核心结论。理论分析与实验结果表明,我们的发现为理解深度学习非凸优化机制提供了新的视角。