The remarkable generalization properties of overparameterized networks are often attributed to implicit biases, such as norm minimization at small learning rates and low sharpness in the Edge-of-Stability regime. In this work, we argue that a comprehensive understanding of the generalization performance of gradient descent requires analyzing the interaction between these various forms of implicit regularization. We empirically demonstrate that the learning rate interpolates between low parameter norm and low sharpness of the trained model. We furthermore prove that neither implicit bias alone minimizes the generalization error for diagonal linear networks trained on a simple regression task. These findings demonstrate that focusing on a single implicit bias is insufficient to explain good generalization, and they motivate a broader view of implicit regularization that captures the dynamic trade-off between norm and sharpness induced by non-negligible learning rates.
翻译:过参数化网络卓越的泛化能力通常归因于隐式偏差,例如小学习率下的范数最小化以及稳定边界状态下的低锐度。本研究认为,全面理解梯度下降的泛化性能需要分析这些不同形式的隐式正则化之间的相互作用。我们通过实验证明,学习率在训练模型的低参数范数与低锐度之间起到插值作用。此外,我们进一步证明,对于在简单回归任务上训练的对角线性网络,仅靠单一隐式偏差无法最小化泛化误差。这些发现表明,聚焦于单一隐式偏差不足以解释良好的泛化现象,并促使我们以更广阔的视角审视隐式正则化,这种视角需捕捉因不可忽略的学习率所诱发的范数与锐度之间的动态权衡。