The learning dynamics of deep neural networks are not well understood. The information bottleneck (IB) theory proclaimed separate fitting and compression phases. But they have since been heavily debated. We comprehensively analyze the learning dynamics by investigating a layer's reconstruction ability of the input and prediction performance based on the evolution of parameters during training. We empirically show the existence of three phases using common datasets and architectures such as ResNet and VGG: (i) near constant reconstruction loss, (ii) decrease, and (iii) increase. We also derive an empirically grounded data model and prove the existence of phases for single-layer networks. Technically, our approach leverages classical complexity analysis. It differs from IB by relying on measuring reconstruction loss rather than information theoretic measures to relate information of intermediate layers and inputs. Our work implies a new best practice for transfer learning: We show empirically that the pre-training of a classifier should stop well before its performance is optimal.
翻译:深度神经网络的学习动态尚未被充分理解。信息瓶颈理论曾声称存在独立的拟合阶段和压缩阶段,但这一观点后来引发了广泛争论。我们通过研究训练过程中参数演化对输入重构能力和预测性能的影响,全面分析了学习动态。基于ResNet和VGG等常见数据集和架构,我们实验证明了三个学习阶段的存在:(i)近乎恒定的重构损失,(ii)重构损失下降,以及(iii)重构损失上升。我们进一步推导了一个基于经验的数据模型,并证明了单层网络中这些阶段的存在性。从方法论上,我们的方法借鉴了经典复杂度分析,通过与信息瓶颈理论不同的路径——即立足于测量重构损失而非信息论度量——来建立中间层与输入之间的信息关联。我们的研究为迁移学习提出了一项新实践准则:实验表明,分类器的预训练应在性能达到最优之前适时停止。