Gradient-based methods successfully train highly overparameterized models in practice, even though the associated optimization problems are markedly nonconvex. Understanding the mechanisms that make such methods effective has become a central problem in modern optimization. To investigate this question in a tractable setting, we study Deep Diagonal Linear Networks. These are multilayer architectures with a reparameterization that preserves convexity in the effective parameter, while inducing a nontrivial geometry in the optimization landscape. Under mild initialization conditions, we show that gradient flow on the layer parameters induces a mirror-flow dynamic in the effective parameter space. This structural insight yields explicit convergence guarantees, including exponential decay of the loss under a Polyak-Lojasiewicz condition, and clarifies how the parametrization and initialization scale govern the training speed. Overall, our results demonstrate that deep diagonal over parameterizations, despite their apparent complexity, can endow standard gradient methods with well-behaved and interpretable optimization dynamics.
翻译:基于梯度的方法在实践中能够成功训练高度过参数化的模型,尽管相关的优化问题具有显著的非凸性。理解使此类方法有效的机制已成为现代优化中的一个核心问题。为了在可处理的背景下研究这一问题,我们研究了深度对角线性网络。这是一种多层架构,其重参数化在有效参数上保持了凸性,同时在优化景观中引入了非平凡的几何结构。在温和的初始化条件下,我们证明了层参数上的梯度流在有效参数空间中诱导出一种镜像流动态。这一结构性洞见产生了显式的收敛保证,包括在Polyak-Lojasiewicz条件下损失的指数衰减,并阐明了参数化与初始化尺度如何控制训练速度。总体而言,我们的结果表明,深度对角过参数化尽管表面复杂,却能为标准梯度方法赋予行为良好且可解释的优化动态。