We present a unified theoretical framework connecting the first property of Deep Neural Collapse (DNC1) to the emergence of implicit low-rank bias in nonlinear networks trained with $L^2$ weight decay regularization. Our main contributions are threefold. First, we derive a quantitative relation between the Total Cluster Variation (TCV) of intermediate embeddings and the numerical rank of stationary weight matrices. In particular, we establish that, at any critical point, the distance from a weight matrix to the set of rank-$K$ matrices is bounded by a constant times the TCV of earlier-layer features, scaled inversely with the weight-decay parameter. Second, we prove global optimality of DNC1 in a constrained representation-cost setting for both feedforward and residual architectures, showing that zero TCV across intermediate layers minimizes the representation cost under natural architectural constraints. Third, we establish a benign landscape property: for almost every interpolating initialization there exists a continuous, loss-decreasing path from the initialization to a globally optimal, DNC1-satisfying configuration. Our theoretical claims are validated empirically; numerical experiments confirm the predicted relations among TCV, singular-value structure, and weight decay. These results indicate that neural collapse and low-rank bias are intimately linked phenomena arising from the optimization geometry induced by weight decay.
翻译:我们提出了一个统一的理论框架,将深度神经坍缩(DNC1)的第一特性与采用$L^2$权重衰减正则化训练的非线性网络中隐式低秩偏置的涌现联系起来。我们的主要贡献有三方面。首先,我们推导了中间嵌入的总聚类变差(TCV)与平稳权重矩阵数值秩之间的定量关系。具体而言,我们证明在任意临界点处,权重矩阵到秩-$K$矩阵集合的距离受一个常数乘以前层特征TCV的界所限制,该常数与权重衰减参数成反比。其次,我们在约束表示成本设置下,针对前馈和残差架构证明了DNC1的全局最优性,表明中间层零TCV在自然架构约束下最小化了表示成本。第三,我们建立了一个良性景观特性:对于几乎每一个插值初始化,都存在一条从该初始化到满足DNC1的全局最优配置的连续且损失递减的路径。我们的理论主张得到了实证验证;数值实验证实了TCV、奇异值结构与权重衰减之间的预测关系。这些结果表明,神经坍缩与低秩偏置是由权重衰减诱导的优化几何所产生的紧密关联现象。