We present the first comprehensive analysis of massive activation development throughout transformer training, using the Pythia model family as our testbed, and release our full dataset publicly to support further research. Through systematic analysis of various model sizes across multiple training checkpoints, we demonstrate that massive activation emergence follows highly predictable mathematical patterns that can be accurately modeled using an exponentially-modulated logarithmic function with five key parameters. Additionally, We develop a machine learning framework to predict these mathematical parameters from architectural specifications alone, achieving high accuracy for steady-state behavior and moderate accuracy for emergence timing and magnitude. These findings enable architects to predict and potentially control key aspects of massive activation emergence through design choices, with significant implications for model stability, training cycle length, interpretability, and optimization. Our findings demonstrate that the emergence of massive activations is governed by model design and can be anticipated, and potentially controlled, before training begins. Code is available at https://github.com/Aimpoint-Digital/massive-activations-fork
翻译:我们首次对Transformer训练过程中大规模激活的发展进行了全面分析,以Pythia模型系列为实验平台,并公开释放完整数据集以支持后续研究。通过对多个训练检查点下不同模型规模的系统分析,我们证明大规模激活的出现遵循高度可预测的数学规律,该规律可通过包含五个关键参数的指数调制对数函数精确建模。此外,我们开发了一个机器学习框架,仅依据架构规格即可预测这些数学参数,在稳态行为预测方面达到高精度,在涌现时机与幅度预测方面达到中等精度。这些发现使架构师能够通过设计选择预测并潜在控制大规模激活涌现的关键特征,对模型稳定性、训练周期长度、可解释性及优化具有重要影响。我们的研究证明大规模激活的涌现受模型设计支配,可在训练开始前进行预测并潜在控制。代码发布于https://github.com/Aimpoint-Digital/massive-activations-fork