Normalization layers are critical components of modern AI systems, such as ChatGPT, Gemini, DeepSeek, etc. Empirically, they are known to stabilize training dynamics and improve generalization ability. However, the underlying theoretical mechanism by which normalization layers contribute to both optimization and generalization remains largely unexplained, especially when using many normalization layers in a deep neural network (DNN). In this work, we develop a theoretical framework that elucidates the role of normalization through the lens of capacity control. We prove that an unnormalized DNN can exhibit exponentially large Lipschitz constants with respect to either its parameters or inputs, implying excessive functional capacity and potential overfitting. Such bad DNNs are uncountably many. In contrast, the insertion of normalization layers provably can reduce the Lipschitz constant at an exponential rate in the number of normalization layers. This exponential reduction yields two fundamental consequences: (1) it smooths the loss landscape at an exponential rate, facilitating faster and more stable optimization; and (2) it constrains the effective capacity of the network, thereby enhancing generalization guarantees on unseen data. Our results thus offer a principled explanation for the empirical success of normalization methods in deep learning.
翻译:归一化层是现代人工智能系统(如ChatGPT、Gemini、DeepSeek等)的关键组件。经验表明,它们能够稳定训练动态并提升泛化能力。然而,关于归一化层如何同时促进优化与泛化的理论机制仍缺乏充分解释,尤其是在深度神经网络中使用多个归一化层时。本研究通过容量控制的视角构建了一个理论框架,以阐明归一化层的作用。我们证明,未归一化的深度神经网络可能对其参数或输入呈现指数级大的Lipschitz常数,这意味着过度的函数容量和潜在的过拟合风险。此类不良深度神经网络在数量上是不可数的。相比之下,归一化层的引入能够以归一化层数量的指数速率降低Lipschitz常数。这种指数级缩减产生两个根本性后果:(1)它以指数速率平滑损失函数景观,从而促进更快速、更稳定的优化;(2)它约束网络的有效容量,从而增强对未见数据的泛化保证。因此,我们的研究结果为深度学习归一化方法的经验成功提供了理论依据。