Spontaneous symmetry breaking in statistical mechanics primarily occurs during phase transitions at the thermodynamic limit where the Hamiltonian preserves inversion symmetry, yet the low-temperature free energy exhibits reduced symmetry. Herein, we demonstrate the emergence of spontaneous symmetry breaking in natural language processing (NLP) models during both pre-training and fine-tuning, even under deterministic dynamics and within a finite training architecture. This phenomenon occurs at the level of individual attention heads and is scaled-down to its small subset of nodes and also valid at a single-nodal level, where nodes acquire the capacity to learn a limited set of tokens after pre-training or labels after fine-tuning for a specific classification task. As the number of nodes increases, a crossover in learning ability occurs, governed by the tradeoff between a decrease following random-guess among increased possible outputs, and enhancement following nodal cooperation, which exceeds the sum of individual nodal capabilities. In contrast to spin-glass systems, where a microscopic state of frozen spins cannot be directly linked to the free-energy minimization goal, each nodal function in this framework contributes explicitly to the global network task and can be upper-bounded using convex hull analysis. Results are demonstrated using BERT-6 architecture pre-trained on Wikipedia dataset and fine-tuned on the FewRel classification task.
翻译:统计力学中的自发对称性破缺主要发生在热力学极限下的相变过程中,此时哈密顿量保持反演对称性,但低温下的自由能却表现出降低的对称性。本文证明了在自然语言处理(NLP)模型中,即使在确定性动力学和有限训练架构下,于预训练和微调阶段也会出现自发对称性破缺现象。这一现象发生在单个注意力头的层面,可缩简至其节点的小型子集,并且在单节点层面同样成立——节点在预训练后获得学习有限标记集的能力,或在针对特定分类任务微调后获得学习标签的能力。随着节点数量的增加,学习能力会出现交叉转变,其调控机制在于两种趋势的权衡:一方面,随着可能输出数量的增加,随机猜测导致的性能下降;另一方面,节点协作带来的性能提升,这种协作效应超越了各节点能力的简单加和。与自旋玻璃系统不同(其微观自旋冻结状态无法直接关联到自由能最小化目标),本框架中每个节点的功能都明确贡献于全局网络任务,并可通过凸包分析进行上界估计。研究结果通过基于维基百科数据集预训练、并在FewRel分类任务上微调的BERT-6架构进行了验证。