Optimizer choice matters for the emergence of Neural Collapse

Neural Collapse (NC) refers to the emergence of highly symmetric geometric structures in the representations of deep neural networks during the terminal phase of training. Despite its prevalence, the theoretical understanding of NC remains limited. Existing analyses largely ignore the role of the optimizer, thereby suggesting that NC is universal across optimization methods. In this work, we challenge this assumption and demonstrate that the choice of optimizer plays a critical role in the emergence of NC. The phenomenon is typically quantified through NC metrics, which, however, are difficult to track and analyze theoretically. To overcome this limitation, we introduce a novel diagnostic metric, NC0, whose convergence to zero is a necessary condition for NC. Using NC0, we provide theoretical evidence that NC cannot emerge under decoupled weight decay in adaptive optimizers, as implemented in AdamW. Concretely, we prove that SGD, SignGD with coupled weight decay (a special case of Adam), and SignGD with decoupled weight decay (a special case of AdamW) exhibit qualitatively different NC0 dynamics. Also, we show the accelerating effect of momentum on NC (beyond convergence of train loss) when trained with SGD, being the first result concerning momentum in the context of NC. Finally, we conduct extensive empirical experiments consisting of 3,900 training runs across various datasets, architectures, optimizers, and hyperparameters, confirming our theoretical results. This work provides the first theoretical explanation for optimizer-dependent emergence of NC and highlights the overlooked role of weight-decay coupling in shaping the implicit biases of optimizers.

翻译：神经坍缩（NC）指深度神经网络在训练末期阶段，其表征中出现的具有高度对称性的几何结构。尽管该现象普遍存在，但对其的理论理解仍较为有限。现有分析大多忽略了优化器的作用，从而暗示NC在不同优化方法中具有普适性。本研究挑战了这一假设，并证明优化器的选择对NC的出现起着关键作用。该现象通常通过NC度量指标进行量化，但这些指标在理论层面难以追踪和分析。为克服此局限，我们引入了一种新的诊断指标NC0，其收敛至零是NC出现的必要条件。利用NC0，我们提供了理论证据，表明在使用解耦权重衰减的自适应优化器（如AdamW中的实现方式）下，NC无法出现。具体而言，我们证明了SGD、采用耦合权重衰减的SignGD（Adam的一种特例）以及采用解耦权重衰减的SignGD（AdamW的一种特例）展现出本质上不同的NC0动态。此外，我们揭示了在使用SGD训练时，动量对NC（超越训练损失收敛）的加速效应，这是关于动量在NC背景下的首个研究结果。最后，我们进行了广泛的实证实验，涵盖不同数据集、架构、优化器和超参数，共计3,900次训练运行，结果证实了我们的理论发现。本工作首次为NC出现的优化器依赖性提供了理论解释，并强调了权重衰减耦合方式在塑造优化器隐式偏置方面被忽视的作用。