Neural collapse ($\mathcal{NC}$) is a phenomenon observed in classification tasks where top-layer representations collapse into their class means, which become equinorm, equiangular and aligned with the classifiers. These behaviors -- associated with generalization and robustness -- would manifest under specific conditions: models are trained towards zero loss, with noise-free labels belonging to balanced classes, which do not outnumber the model's hidden dimension. Recent studies have explored $\mathcal{NC}$ in the absence of one or more of these conditions to extend and capitalize on the associated benefits of ideal geometries. Language modeling presents a curious frontier, as \textit{training by token prediction} constitutes a classification task where none of the conditions exist: the vocabulary is imbalanced and exceeds the embedding dimension; different tokens might correspond to similar contextual embeddings; and large language models (LLMs) in particular are typically only trained for a few epochs. This paper empirically investigates the impact of scaling the architectures and training of causal language models (CLMs) on their progression towards $\mathcal{NC}$. We find that $\mathcal{NC}$ properties that develop with scale (and regularization) are linked to generalization. Moreover, there is evidence of some relationship between $\mathcal{NC}$ and generalization independent of scale. Our work thereby underscores the generality of $\mathcal{NC}$ as it extends to the novel and more challenging setting of language modeling. Downstream, we seek to inspire further research on the phenomenon to deepen our understanding of LLMs -- and neural networks at large -- and improve existing architectures based on $\mathcal{NC}$-related properties. Our code is hosted on GitHub at https://github.com/rhubarbwu/linguistic-collapse .
翻译:神经崩溃($\mathcal{NC}$)是分类任务中观察到的一种现象,其中顶层表示会坍缩至其类别均值,这些均值最终变得等范数、等角度,并与分类器对齐。这些行为——与泛化性和鲁棒性相关——在特定条件下才会显现:模型被训练至损失趋近于零,使用无噪声标签且类别平衡,且类别数量不超过模型的隐藏维度。近期研究探索了在缺少一个或多个上述条件时 $\mathcal{NC}$ 的表现,以扩展并利用理想几何结构带来的相关益处。语言建模呈现了一个独特的前沿领域,因为**基于词元预测的训练**本质上构成了一项分类任务,而该任务中上述条件均不成立:词汇表分布不平衡且规模超过嵌入维度;不同词元可能对应相似的上下文嵌入;特别是大语言模型(LLMs)通常仅训练少数几个轮次。本文通过实证研究,探讨了因果语言模型(CLMs)的架构扩展与训练过程对其向 $\mathcal{NC}$ 状态演进的影响。我们发现,随模型规模(及正则化)增长而显现的 $\mathcal{NC}$ 特性与泛化能力相关。此外,有证据表明 $\mathcal{NC}$ 与泛化之间存在某种独立于模型规模的关系。因此,我们的工作强调了 $\mathcal{NC}$ 的普适性,其可扩展至语言建模这一新颖且更具挑战性的场景。在下游方向,我们旨在激发针对该现象的进一步研究,以深化我们对大语言模型——乃至更广泛的神经网络——的理解,并基于 $\mathcal{NC}$ 相关特性改进现有架构。我们的代码已托管于 GitHub:https://github.com/rhubarbwu/linguistic-collapse。