Neural collapse describes the geometry of activation in the final layer of a deep neural network when it is trained beyond performance plateaus. Open questions include whether neural collapse leads to better generalization and, if so, why and how training beyond the plateau helps. We model neural collapse as an information bottleneck (IB) problem in order to investigate whether such a compact representation exists and discover its connection to generalization. We demonstrate that neural collapse leads to good generalization specifically when it approaches an optimal IB solution of the classification problem. Recent research has shown that two deep neural networks independently trained with the same contrastive loss objective are linearly identifiable, meaning that the resulting representations are equivalent up to a matrix transformation. We leverage linear identifiability to approximate an analytical solution of the IB problem. This approximation demonstrates that when class means exhibit $K$-simplex Equiangular Tight Frame (ETF) behavior (e.g., $K$=10 for CIFAR10 and $K$=100 for CIFAR100), they coincide with the critical phase transitions of the corresponding IB problem. The performance plateau occurs once the optimal solution for the IB problem includes all of these phase transitions. We also show that the resulting $K$-simplex ETF can be packed into a $K$-dimensional Gaussian distribution using supervised contrastive learning with a ResNet50 backbone. This geometry suggests that the $K$-simplex ETF learned by supervised contrastive learning approximates the optimal features for source coding. Hence, there is a direct correspondence between optimal IB solutions and generalization in contrastive learning.
翻译:神经坍缩描述了深度神经网络在训练超越性能平台期后,其最终层激活的几何特性。悬而未决的问题包括:神经坍缩是否会导致更好的泛化能力?如果是,为何以及如何通过超越平台期的训练来实现这一效果?我们将神经坍缩建模为一个信息瓶颈问题,以探究此类紧凑表示是否存在,并揭示其与泛化能力之间的关联。我们证明,当神经坍缩趋近于分类问题的最优信息瓶颈解时,会带来良好的泛化性能。近期研究表明,使用相同对比损失目标独立训练的两个深度神经网络具有线性可识别性,即所得表示在矩阵变换下等价。我们利用线性可识别性来逼近信息瓶颈问题的解析解。该近似解表明,当类别均值呈现$K$-单纯形等角紧框架行为时(例如CIFAR10中$K$=10,CIFAR100中$K$=100),它们与相应信息瓶颈问题的临界相变点重合。一旦信息瓶颈问题的最优解包含所有这些相变点,性能平台期即会出现。我们还证明,通过使用ResNet50主干网络的监督对比学习,所得$K$-单纯形等角紧框架可被封装为$K$维高斯分布。这种几何特性表明,监督对比学习习得的$K$-单纯形等角紧框架近似于信源编码的最优特征。因此,最优信息瓶颈解与对比学习中的泛化能力存在直接对应关系。