The training process of ReLU neural networks often exhibits complicated nonlinear phenomena. The nonlinearity of models and non-convexity of loss pose significant challenges for theoretical analysis. Therefore, most previous theoretical works on the optimization dynamics of neural networks focus either on local analysis (like the end of training) or approximate linear models (like Neural Tangent Kernel). In this work, we conduct a complete theoretical characterization of the training process of a two-layer ReLU network trained by Gradient Flow on a linearly separable data. In this specific setting, our analysis captures the whole optimization process starting from random initialization to final convergence. Despite the relatively simple model and data that we studied, we reveal four different phases from the whole training process showing a general simplifying-to-complicating learning trend. Specific nonlinear behaviors can also be precisely identified and captured theoretically, such as initial condensation, saddle-to-plateau dynamics, plateau escape, changes of activation patterns, learning with increasing complexity, etc.
翻译:ReLU神经网络的训练过程常展现出复杂的非线性现象。模型的非线性和损失函数的非凸性给理论分析带来了重大挑战。因此,先前大多数关于神经网络优化动力学的理论研究要么聚焦于局部分析(如训练末期),要么依赖近似线性模型(如神经正切核)。在本研究中,我们对梯度流在线性可分数据上训练的两层ReLU网络的完整训练过程进行了理论刻画。在这一特定设定下,我们的分析捕获了从随机初始化到最终收敛的整个优化过程。尽管我们研究的模型和数据相对简单,但我们在整个训练过程中揭示了四个不同阶段,展现出一种从简单到复杂的一般性学习趋势。具体的非线性行为,如初始凝聚、鞍点-平稳区动力学、平稳区逃逸、激活模式变化以及复杂度递增的学习等,均可被精确识别并从理论上予以捕获。