Linguistic Collapse: Neural Collapse in (Large) Language Models

Neural collapse ($\mathcal{NC}$) is a phenomenon observed in classification tasks where top-layer representations collapse into their class means, which become equinorm, equiangular and aligned with the classifiers. These behaviors -- associated with generalization and robustness -- would manifest under specific conditions: models are trained towards zero loss, with noise-free labels belonging to balanced classes, which do not outnumber the model's hidden dimension. Recent studies have explored $\mathcal{NC}$ in the absence of one or more of these conditions to extend and capitalize on the associated benefits of ideal geometries. Language modeling presents a curious frontier, as \textit{training by token prediction} constitutes a classification task where none of the conditions exist: the vocabulary is imbalanced and exceeds the embedding dimension; different tokens might correspond to similar contextual embeddings; and large language models (LLMs) in particular are typically only trained for a few epochs. This paper empirically investigates the impact of scaling the architectures and training of causal language models (CLMs) on their progression towards $\mathcal{NC}$. We find that $\mathcal{NC}$ properties that develop with scale (and regularization) are linked to generalization. Moreover, there is evidence of some relationship between $\mathcal{NC}$ and generalization independent of scale. Our work thereby underscores the generality of $\mathcal{NC}$ as it extends to the novel and more challenging setting of language modeling. Downstream, we seek to inspire further research on the phenomenon to deepen our understanding of LLMs -- and neural networks at large -- and improve existing architectures based on $\mathcal{NC}$-related properties. Our code is hosted on GitHub at https://github.com/rhubarbwu/linguistic-collapse .

翻译：神经崩溃（$\mathcal{NC}$）是分类任务中观察到的一种现象，其中顶层表示会坍缩至其类别均值，这些均值最终变得等范数、等角度，并与分类器对齐。这些行为——与泛化性和鲁棒性相关——在特定条件下才会显现：模型被训练至损失趋近于零，使用无噪声标签且类别平衡，且类别数量不超过模型的隐藏维度。近期研究探索了在缺少一个或多个上述条件时 $\mathcal{NC}$ 的表现，以扩展并利用理想几何结构带来的相关益处。语言建模呈现了一个独特的前沿领域，因为**基于词元预测的训练**本质上构成了一项分类任务，而该任务中上述条件均不成立：词汇表分布不平衡且规模超过嵌入维度；不同词元可能对应相似的上下文嵌入；特别是大语言模型（LLMs）通常仅训练少数几个轮次。本文通过实证研究，探讨了因果语言模型（CLMs）的架构扩展与训练过程对其向 $\mathcal{NC}$ 状态演进的影响。我们发现，随模型规模（及正则化）增长而显现的 $\mathcal{NC}$ 特性与泛化能力相关。此外，有证据表明 $\mathcal{NC}$ 与泛化之间存在某种独立于模型规模的关系。因此，我们的工作强调了 $\mathcal{NC}$ 的普适性，其可扩展至语言建模这一新颖且更具挑战性的场景。在下游方向，我们旨在激发针对该现象的进一步研究，以深化我们对大语言模型——乃至更广泛的神经网络——的理解，并基于 $\mathcal{NC}$ 相关特性改进现有架构。我们的代码已托管于 GitHub：https://github.com/rhubarbwu/linguistic-collapse。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日