Deep neural networks (DNNs) defy the classical bias-variance trade-off: adding parameters to a DNN that interpolates its training data will typically improve its generalization performance. Explaining the mechanism behind this ``benign overfitting'' in deep networks remains an outstanding challenge. Here, we study the last hidden layer representations of various state-of-the-art convolutional neural networks and find that if the last hidden representation is wide enough, its neurons tend to split into groups that carry identical information, and differ from each other only by statistically independent noise. The number of such groups increases linearly with the width of the layer, but only if the width is above a critical value. We show that redundant neurons appear only when the training process reaches interpolation and the training error is zero.
翻译:深度神经网络(DNN)违背了经典的偏差-方差权衡:向一个完美拟合训练数据的DNN添加参数,通常会提升其泛化性能。解释深度网络中这种“良性过拟合”背后的机制仍是一个突出挑战。在此,我们研究了多种先进卷积神经网络的最后一个隐藏层表示,发现若该隐藏表示足够宽,其神经元倾向于分裂为携带相同信息的组,且各组之间仅因统计独立的噪声而彼此不同。此类组的数量随层宽度线性增加,但仅当宽度超过临界值时成立。我们表明,冗余神经元仅在训练过程达到完美拟合且训练误差为零时出现。