In deep learning it is common to overparameterize neural networks, that is, to use more parameters than training samples. Quite surprisingly training the neural network via (stochastic) gradient descent leads to models that generalize very well, while classical statistics would suggest overfitting. In order to gain understanding of this implicit bias phenomenon we study the special case of sparse recovery (compressed sensing) which is of interest on its own. More precisely, in order to reconstruct a vector from underdetermined linear measurements, we introduce a corresponding overparameterized square loss functional, where the vector to be reconstructed is deeply factorized into several vectors. We show that, if there exists an exact solution, vanilla gradient flow for the overparameterized loss functional converges to a good approximation of the solution of minimal $\ell_1$-norm. The latter is well-known to promote sparse solutions. As a by-product, our results significantly improve the sample complexity for compressed sensing via gradient flow/descent on overparameterized models derived in previous works. The theory accurately predicts the recovery rate in numerical experiments. Our proof relies on analyzing a certain Bregman divergence of the flow. This bypasses the obstacles caused by non-convexity and should be of independent interest.
翻译:在深度学习中,过参数化神经网络(即使用比训练样本更多的参数)是常见的做法。令人惊讶的是,通过(随机)梯度下降训练神经网络所得到的模型泛化能力极佳,而传统统计学通常认为这将导致过拟合。为理解这一隐式偏差现象,我们研究了稀疏恢复(压缩感知)这一特例,其本身也具有研究价值。具体而言,为了从欠定线性测量中重构向量,我们引入了相应的过参数化平方损失函数,其中待重构向量被深度分解为多个向量。我们证明,若存在精确解,则针对过参数化损失函数的原始梯度流将收敛至最小$\ell_1$范数解的良好近似。众所周知,最小$\ell_1$范数能促进稀疏解的产生。作为副产品,我们的结果显著提升了此前工作中基于过参数化模型梯度流/下降的压缩感知样本复杂度。该理论在数值实验中能准确预测恢复率。我们的证明依赖于分析流的特定Bregman散度,这绕过了非凸性造成的障碍,并应具有独立研究价值。