Neural collapse, i.e., the emergence of highly symmetric, class-wise clustered representations, is frequently observed in deep networks and is often assumed to reflect or enable generalization. In parallel, flatness of the loss landscape has been theoretically and empirically linked to generalization. Yet, the causal role of either phenomenon remains unclear: Are they prerequisites for generalization, or merely by-products of training dynamics? We disentangle these questions using grokking, a training regime in which memorization precedes generalization, allowing us to temporally separate generalization from training dynamics and we find that while both neural collapse and relative flatness emerge near the onset of generalization, only flatness consistently predicts it. Models encouraged to collapse or prevented from collapsing generalize equally well, whereas models regularized away from flat solutions exhibit delayed generalization, resembling grokking, even in architectures and datasets where it does not typically occur. Furthermore, we show theoretically that neural collapse leads to relative flatness under classical assumptions, explaining their empirical co-occurrence. Our results support the view that relative flatness is a potentially necessary and more fundamental property for generalization, and demonstrate how grokking can serve as a powerful probe for isolating its geometric underpinnings.
翻译:神经坍缩——即高度对称、按类别聚类的表征的出现——在深度网络中经常被观察到,且常被认为反映或促成了泛化能力。与此同时,损失景观的平坦性在理论与实证上均与泛化能力相关联。然而,这两种现象之间的因果关系仍不明确:它们是泛化的先决条件,抑或仅仅是训练动态的副产品?我们利用"顿悟"(grokking)这一训练机制来厘清这些问题。在该机制中,记忆先于泛化发生,使我们能够从时间上分离泛化过程与训练动态。研究发现,虽然神经坍缩与相对平坦性均在泛化起始阶段出现,但仅有平坦性能够一致地预测泛化。无论模型被促进坍缩还是被阻止坍缩,其泛化能力均表现相当;而通过正则化远离平坦解的模型则表现出泛化延迟,类似于顿悟现象,即使在通常不会出现该现象的架构与数据集中亦是如此。此外,我们从理论上证明,在经典假设下神经坍缩会导致相对平坦性,这解释了二者在实证中的共现性。我们的研究结果支持以下观点:相对平坦性是泛化过程中一个潜在必要且更根本的性质,并展示了顿悟现象如何作为一种强有力的探针,用以分离其几何基础。