The study of Deep Network (DN) training dynamics has largely focused on the evolution of the loss function, evaluated on or around train and test set data points. In fact, many DN phenomenon were first introduced in literature with that respect, e.g., double descent, grokking. In this study, we look at the training dynamics of the input space partition or linear regions formed by continuous piecewise affine DNs, e.g., networks with (leaky)ReLU nonlinearities. First, we present a novel statistic that encompasses the local complexity (LC) of the DN based on the concentration of linear regions inside arbitrary dimensional neighborhoods around data points. We observe that during training, the LC around data points undergoes a number of phases, starting with a decreasing trend after initialization, followed by an ascent and ending with a final descending trend. Using exact visualization methods, we come across the perplexing observation that during the final LC descent phase of training, linear regions migrate away from training and test samples towards the decision boundary, making the DN input-output nearly linear everywhere else. We also observe that the different LC phases are closely related to the memorization and generalization performance of the DN, especially during grokking.
翻译:对深度网络训练动态的研究,长期以来主要聚焦于损失函数在训练和测试集数据点上或其附近的演化。事实上,许多深度网络现象最初都是通过这一视角被引入文献的,例如双重下降和顿悟(grokking)。在本研究中,我们关注由连续分段仿射深度网络(例如使用(泄漏)ReLU非线性激活函数的网络)所构成的输入空间划分或线性区域的训练动态。首先,我们提出了一种新的统计量,该统计量基于数据点周围任意维度邻域内线性区域的集中程度,体现了深度网络的局部复杂度。我们观察到,在训练过程中,数据点周围的局部复杂度经历了多个阶段:从初始化后的下降趋势开始,随后上升,最终以下降趋势结束。通过精确的可视化方法,我们得出一个令人困惑的观察结果:在训练的最后局部复杂度下降阶段,线性区域从训练和测试样本处向决策边界迁移,使得深度网络在其他地方的输入-输出映射几乎呈线性。我们还观察到,不同的局部复杂度阶段与深度网络的记忆和泛化性能密切相关,尤其在顿悟过程中。