Deep neural networks are widely known for their remarkable effectiveness across various tasks, with the consensus that deeper networks implicitly learn more complex data representations. This paper shows that sufficiently deep networks trained for supervised image classification split into two distinct parts that contribute to the resulting data representations differently. The initial layers create linearly-separable representations, while the subsequent layers, which we refer to as \textit{the tunnel}, compress these representations and have a minimal impact on the overall performance. We explore the tunnel's behavior through comprehensive empirical studies, highlighting that it emerges early in the training process. Its depth depends on the relation between the network's capacity and task complexity. Furthermore, we show that the tunnel degrades out-of-distribution generalization and discuss its implications for continual learning.
翻译:深度神经网络因其在各种任务中的显著有效性而广为人知,普遍认为更深层的网络能隐式学习更复杂的数据表示。本文证明,为监督图像分类训练的足够深的网络会分裂成两个不同部分,它们对最终数据表示的贡献方式各异。初始层创建线性可分离的表示,而后续层(我们称之为“隧道”)则压缩这些表示,且对整体性能的影响极小。我们通过全面的实证研究探讨了隧道的行为,强调其在训练过程早期便出现。其深度取决于网络容量与任务复杂度之间的关系。此外,我们证明隧道会降低分布外泛化性能,并讨论其对持续学习的影响。