Effective Rank and the Staircase Phenomenon: New Insights into Neural Network Training Dynamics

In recent years, deep learning, powered by neural networks, has achieved widespread success in solving high-dimensional problems, particularly those with low-dimensional feature structures. This success stems from their ability to identify and learn low dimensional features tailored to the problems. Understanding how neural networks extract such features during training dynamics remains a fundamental question in deep learning theory. In this work, we propose a novel perspective by interpreting the neurons in the last hidden layer of a neural network as basis functions that represent essential features. To explore the linear independence of these basis functions throughout the deep learning dynamics, we introduce the concept of 'effective rank'. Our extensive numerical experiments reveal a notable phenomenon: the effective rank increases progressively during the learning process, exhibiting a staircase-like pattern, while the loss function concurrently decreases as the effective rank rises. We refer to this observation as the 'staircase phenomenon'. Specifically, for deep neural networks, we rigorously prove the negative correlation between the loss function and effective rank, demonstrating that the lower bound of the loss function decreases with increasing effective rank. Therefore, to achieve a rapid descent of the loss function, it is critical to promote the swift growth of effective rank. Ultimately, we evaluate existing advanced learning methodologies and find that these approaches can quickly achieve a higher effective rank, thereby avoiding redundant staircase processes and accelerating the rapid decline of the loss function.

翻译：近年来，由神经网络驱动的深度学习在解决高维问题，特别是具有低维特征结构的问题方面取得了广泛成功。这一成功源于其识别和学习针对问题定制的低维特征的能力。理解神经网络在训练动力学过程中如何提取此类特征，仍然是深度学习理论中的一个基本问题。在本工作中，我们提出了一种新颖的视角，将神经网络最后一个隐藏层中的神经元解释为表示基本特征的基函数。为了探索这些基函数在整个深度学习动力学过程中的线性独立性，我们引入了"有效秩"的概念。我们大量的数值实验揭示了一个显著现象：有效秩在学习过程中逐步增加，呈现出一种阶梯状模式，而损失函数则随着有效秩的上升而同步下降。我们将这一观察称为"阶梯现象"。具体而言，对于深度神经网络，我们严格证明了损失函数与有效秩之间的负相关性，表明损失函数的下界随着有效秩的增加而降低。因此，为了实现损失函数的快速下降，关键在于促进有效秩的迅速增长。最终，我们评估了现有的先进学习方法，发现这些方法能够快速达到更高的有效秩，从而避免了冗余的阶梯过程，并加速了损失函数的快速下降。