In deep learning it is common to overparameterize neural networks, that is, to use more parameters than training samples. Quite surprisingly training the neural network via (stochastic) gradient descent leads to models that generalize very well, while classical statistics would suggest overfitting. In order to gain understanding of this implicit bias phenomenon we study the special case of sparse recovery (compressed sensing) which is of interest on its own. More precisely, in order to reconstruct a vector from underdetermined linear measurements, we introduce a corresponding overparameterized square loss functional, where the vector to be reconstructed is deeply factorized into several vectors. We show that, if there exists an exact solution, vanilla gradient flow for the overparameterized loss functional converges to a good approximation of the solution of minimal $\ell_1$-norm. The latter is well-known to promote sparse solutions. As a by-product, our results significantly improve the sample complexity for compressed sensing via gradient flow/descent on overparameterized models derived in previous works. The theory accurately predicts the recovery rate in numerical experiments. Our proof relies on analyzing a certain Bregman divergence of the flow. This bypasses the obstacles caused by non-convexity and should be of independent interest.
翻译:在深度学习中,过度参数化神经网络(即使用比训练样本更多的参数)是常见做法。令人惊讶的是,通过(随机)梯度下降训练神经网络得到的模型具有极好的泛化能力,而经典统计学理论则会预测过拟合。为深入理解这一隐式偏置现象,我们研究了稀疏恢复(压缩感知)这一自身具有重要意义的特例。具体而言,为从欠定线性测量中重构向量,我们引入了一个相应的过参数化平方损失函数,其中待重构向量被深度分解为多个向量。我们证明:若存在精确解,过参数化损失函数下的标准梯度流会收敛到最小$\ell_1$范数解的优良近似——而最小$\ell_1$范数解已知能够促进稀疏解。作为副产品,我们的结果显著改进了先前工作中基于过参数化模型进行梯度流/梯度下降压缩感知的样本复杂度。该理论能准确预测数值实验中的恢复率。证明过程依赖于分析流的一条特定Bregman散度,这绕过了非凸性造成的障碍,且该分析方法本身应具有独立价值。