Overparameterized models may have many interpolating solutions; implicit regularization refers to the hidden preference of a particular optimization method towards a certain interpolating solution among the many. A by now established line of work has shown that (stochastic) gradient descent tends to have an implicit bias towards low rank and/or sparse solutions when used to train deep linear networks, explaining to some extent why overparameterized neural network models trained by gradient descent tend to have good generalization performance in practice. However, existing theory for square-loss objectives often requires very small initialization of the trainable weights, which is at odds with the larger scale at which weights are initialized in practice for faster convergence and better generalization performance. In this paper, we aim to close this gap by incorporating and analyzing gradient descent with weight normalization, where the weight vector is reparamterized in terms of polar coordinates, and gradient descent is applied to the polar coordinates. By analyzing key invariants of the gradient flow and using Lojasiewicz's Theorem, we show that weight normalization also has an implicit bias towards sparse solutions in the diagonal linear model, but that in contrast to plain gradient descent, weight normalization enables a robust bias that persists even if the weights are initialized at practically large scale. Experiments suggest that the gains in both convergence speed and robustness of the implicit bias are improved dramatically by using weight normalization in overparameterized diagonal linear network models.
翻译:过参数化模型可能存在许多插值解;隐式正则化指的是特定优化方法在众多插值解中隐藏性地偏好某一解。目前已有的一系列研究表明,(随机)梯度下降在训练深度线性网络时,倾向于对低秩和/或稀疏解产生隐式偏差,这在一定程度上解释了为何通过梯度下降训练的过参数化神经网络模型在实践中通常具有良好的泛化性能。然而,现有关于平方损失目标的理论通常要求可训练权重进行非常小的初始化,这与实践中为加速收敛和提升泛化性能而采用较大权重初始化的做法相矛盾。本文旨在通过结合并分析带权重归一化的梯度下降来弥合这一差距,其中权重向量被重新参数化为极坐标形式,梯度下降应用于极坐标。通过分析梯度流的关键不变量并运用Lojasiewicz定理,我们证明了在正交线性模型中,权重归一化同样对稀疏解具有隐式偏差,但与普通梯度下降不同,权重归一化能够产生鲁棒性偏差,即使权重在实际上较大的尺度下初始化,该偏差依然存在。实验表明,在过参数化对角线性网络模型中使用权重归一化,可以显著提升收敛速度以及隐式偏差的鲁棒性。