Overparameterized models may have many interpolating solutions; implicit regularization refers to the hidden preference of a particular optimization method towards a certain interpolating solution among the many. A by now established line of work has shown that (stochastic) gradient descent tends to have an implicit bias towards low rank and/or sparse solutions when used to train deep linear networks, explaining to some extent why overparameterized neural network models trained by gradient descent tend to have good generalization performance in practice. However, existing theory for square-loss objectives often requires very small initialization of the trainable weights, which is at odds with the larger scale at which weights are initialized in practice for faster convergence and better generalization performance. In this paper, we aim to close this gap by incorporating and analyzing gradient descent with weight normalization, where the weight vector is reparamterized in terms of polar coordinates, and gradient descent is applied to the polar coordinates. By analyzing key invariants of the gradient flow and using Lojasiewicz's Theorem, we show that weight normalization also has an implicit bias towards sparse solutions in the diagonal linear model, but that in contrast to plain gradient descent, weight normalization enables a robust bias that persists even if the weights are initialized at practically large scale. Experiments suggest that the gains in both convergence speed and robustness of the implicit bias are improved dramatically by using weight normalization in overparameterized diagonal linear network models.
翻译:过参数化模型可能存在多个插值解;隐式正则化是指特定优化方法在众多插值解中对某一解隐藏的偏好。已有研究表明,在使用(随机)梯度下降训练深度线性网络时,该方法倾向于隐式偏向低秩和/或稀疏解,这在一定程度上解释了为何由梯度下降训练的过参数化神经网络模型在实践中通常具有良好的泛化性能。然而,现有针对平方损失目标的理论通常要求训练权重初始值极小,这与实践中为加快收敛和提升泛化性能而采用较大权重初始化的做法相悖。本文旨在弥合这一差距,通过引入并分析结合权重归一化的梯度下降方法,其中权重向量以极坐标形式重新参数化,并对极坐标应用梯度下降。通过分析梯度流的关键不变量并利用Lojasiewicz定理,我们证明了在diagonal线性模型中,权重归一化同样具有对稀疏解的隐式偏向,但与普通梯度下降不同,权重归一化能够实现稳健的偏向,即使权重在实用大尺度下初始化,该偏向仍能保持。实验表明,在过参数化diagonal线性网络模型中使用权重归一化,在收敛速度和隐式偏向的稳健性方面均有显著提升。