This work, based on Random Matrix Theory (RMT), introduces a novel early-stopping strategy for Transformer training dynamics. Utilizing the Power Law (PL) fit to tansformer attention matrices as a probe, we demarcate training into three stages: structural exploration, heavy-tailed structure stabilization, and convergence saturation. Empirically, we observe that the spectral density of the shallow self-attention matrix $V$ consistently evolves into a heavy-tailed distribution. Crucially, we propose two consistent and validation-set-free criteria: a quantitative metric for heavy-tailed dynamics and a novel spectral signature indicative of convergence. The strong alignment between these criteria highlights the utility of RMT for monitoring and diagnosing the progression of Transformer model training.
翻译:本研究基于随机矩阵理论,提出了一种针对Transformer训练动态的新型早停策略。通过将幂律拟合作为Transformer注意力矩阵的探针,我们将训练过程划分为三个阶段:结构探索阶段、重尾结构稳定阶段和收敛饱和阶段。实证研究表明,浅层自注意力矩阵$V$的谱密度始终会演化为重尾分布。关键贡献在于提出了两个一致且无需验证集的判定准则:用于量化重尾动态的度量指标,以及指示收敛状态的新型谱特征。这些准则之间的高度一致性,凸显了随机矩阵理论在监测和诊断Transformer模型训练进程中的实用价值。