In this paper, we use matrix information theory as an analytical tool to analyze the dynamics of the information interplay between data representations and classification head vectors in the supervised learning process. Specifically, inspired by the theory of Neural Collapse, we introduce matrix mutual information ratio (MIR) and matrix entropy difference ratio (HDR) to assess the interactions of data representation and class classification heads in supervised learning, and we determine the theoretical optimal values for MIR and HDR when Neural Collapse happens. Our experiments show that MIR and HDR can effectively explain many phenomena occurring in neural networks, for example, the standard supervised training dynamics, linear mode connectivity, and the performance of label smoothing and pruning. Additionally, we use MIR and HDR to gain insights into the dynamics of grokking, which is an intriguing phenomenon observed in supervised training, where the model demonstrates generalization capabilities long after it has learned to fit the training data. Furthermore, we introduce MIR and HDR as loss terms in supervised and semi-supervised learning to optimize the information interactions among samples and classification heads. The empirical results provide evidence of the method's effectiveness, demonstrating that the utilization of MIR and HDR not only aids in comprehending the dynamics throughout the training process but can also enhances the training procedure itself.
翻译:本文采用矩阵信息论作为分析工具,研究监督学习过程中数据表示与分类头向量之间信息交互的动态特性。具体而言,受神经坍缩理论启发,我们引入矩阵互信息比与矩阵熵差比来评估监督学习中数据表示与类别分类头之间的相互作用,并确定了神经坍缩发生时这两个指标的理论最优值。实验表明,矩阵互信息比与矩阵熵差比能有效解释神经网络中出现的多种现象,例如标准监督训练动态、线性模式连通性,以及标签平滑与剪枝技术的性能表现。此外,我们运用这两个指标深入探究了顿悟现象的动态机制——该现象指模型在完全拟合训练数据后,经过较长时间才展现出泛化能力的特殊训练过程。进一步地,我们将矩阵互信息比与矩阵熵差比作为损失项引入监督与半监督学习,以优化样本与分类头之间的信息交互。实证结果验证了该方法的有效性,表明矩阵互信息比与矩阵熵差比不仅有助于理解整个训练过程的动态特性,还能优化训练过程本身。