When Both Layers Learn: Training Dynamics of Representing Linear Models via ReLU Networks

In this paper, we study the gradient descent dynamics for jointly training both layers of a one-hidden-layer ReLU network to fit a linear target function. Concretely, we consider a realizable setting where inputs are drawn i.i.d. from a Gaussian distribution and labels follow a planted linear model. This stylized framework captures salient features of end-to-end training in inverse problems and certain auto-encoder models. Despite its apparent simplicity, the dynamics remain poorly understood, in part because the loss landscape contains multiple non-strict saddle points, making it unclear why gradient descent from random initialization reliably escapes bad stationary regions. We provide a detailed characterization of the optimization landscape and prove that gradient descent from a moderately small random initialization-simultaneously training both layers-converges to a global minimizer at a linear rate with order-wise optimal sample complexity. Our analysis tracks the trajectory through three phases: an alignment phase in which hidden weights progressively align with the planted direction while the output weights maintain the correct sign pattern; a growth phase in which the norms of both layers increase while preserving alignment; and a local refinement phase in which the aligned neurons rapidly converge to the planted direction, yielding fast local convergence. To rigorously show that GD avoids non-strict saddles, we develop trajectory-level control arguments for the end-to-end dynamics. In addition, we establish novel uniform concentration results that hold along the entire trajectory, and are essential for obtaining order-wise optimal sample complexity. We corroborate our theory with extensive experiments across a range of configurations.

翻译：本文研究了单隐层ReLU网络联合训练两层参数以拟合线性目标函数的梯度下降动力学。具体而言，我们考虑一个可实现场景：输入独立同分布于高斯分布，标签遵循预设的线性模型。这一框架捕捉了逆问题及某些自编码器模型中端到端训练的核心特征。尽管问题看似简单，其动力学机制至今尚未被充分理解——部分原因在于损失景观中包含多个非严格鞍点，这使得随机初始化后的梯度下降为何能可靠逃离不良驻留区域尚不明确。我们给出优化景观的详细刻画，并证明：从适度小的随机初始化出发，同时训练两层的梯度下降能以线性速率收敛至全局最优解，且样本复杂度达到阶数最优。我们的分析通过三个阶段追踪轨迹：对齐阶段中，隐层权重逐步与预设方向对齐，同时输出权重保持正确符号模式；增长阶段中，两层范数在保持对齐的同时同步增加；局部精化阶段中，已对齐的神经元快速收敛至预设方向，实现快速局部收敛。为严格证明梯度下降能避开非严格鞍点，我们针对端到端动力学建立了轨迹级控制论证。此外，我们提出了沿整个轨迹成立的统一集中性结果，这对实现阶数最优样本复杂度至关重要。我们通过一系列配置下的广泛实验验证了理论分析。