Deep double descent is one of the key phenomena underlying the generalization capability of deep learning models. In this study, epoch-wise double descent, which is delayed generalization following overfitting, was empirically investigated by focusing on the evolution of internal structures. Fully connected neural networks of three different sizes were trained on the CIFAR-10 dataset with 30% label noise. By decomposing the loss curves into signal contributions from clean and noisy training data, the epoch-wise evolutions of internal signals were analyzed separately. Three main findings were obtained from this analysis. First, the model achieved strong re-generalization on test data even after perfectly fitting noisy training data during the double descent phase, corresponding to a "benign overfitting" state. Second, noisy data were learned after clean data, and as learning progressed, their corresponding internal activations became increasingly separated in outer layers; this enabled the model to overfit only noisy data. Third, a single, very large activation emerged in the shallow layer across all models; this phenomenon is referred as "outliers," "massive activa-tions," and "super activations" in recent large language models and evolves with re-generalization. The magnitude of large activation correlated with input patterns but not with output patterns. These empirical findings directly link the recent key phenomena of "deep double descent," "benign overfitting," and "large activation", and support the proposal of a novel scenario for understanding deep double descent.
翻译:深度双重下降是理解深度学习模型泛化能力的关键现象之一。本研究通过聚焦内部结构的演化,对逐轮次双重下降(即过拟合后延迟的泛化现象)进行了实证研究。我们在添加30%标签噪声的CIFAR-10数据集上,训练了三种不同规模的完全连接神经网络。通过将损失曲线分解为来自干净训练数据和噪声训练数据的信号贡献,我们分别分析了内部信号的逐轮次演化。分析得到三个主要发现:首先,在双重下降阶段,即使模型已完全拟合噪声训练数据,仍能在测试数据上实现显著的再泛化,这对应着"良性过拟合"状态;其次,噪声数据在干净数据之后被学习,且随着学习进程推进,其对应的内部激活在外层网络中逐渐分离,这使得模型能够仅对噪声数据过拟合;第三,在所有模型的浅层网络中均出现单一且幅值极大的激活,该现象在近期大语言模型研究中被称为"异常激活"、"巨量激活"或"超级激活",并随着再泛化过程持续演化。大激活的幅值与输入模式相关,但与输出模式无关。这些实证发现直接串联了近期"深度双重下降"、"良性过拟合"和"大激活"三大关键现象,为理解深度双重下降提出了新的理论框架。