The impact of randomness on model training is poorly understood. How do differences in data order and initialization actually manifest in the model, such that some training runs outperform others or converge faster? Furthermore, how can we interpret the resulting training dynamics and the phase transitions that characterize different trajectories? To understand the effect of randomness on the dynamics and outcomes of neural network training, we train models multiple times with different random seeds and compute a variety of metrics throughout training, such as the $L_2$ norm, mean, and variance of the neural network's weights. We then fit a hidden Markov model (HMM) over the resulting sequences of metrics. The HMM represents training as a stochastic process of transitions between latent states, providing an intuitive overview of significant changes during training. Using our method, we produce a low-dimensional, discrete representation of training dynamics on grokking tasks, image classification, and masked language modeling. We use the HMM representation to study phase transitions and identify latent "detour" states that slow down convergence.
翻译:随机性对模型训练的影响尚未得到充分理解。数据顺序和初始化方式的差异如何具体体现在模型中,使得部分训练过程表现更优或收敛更快?此外,我们应如何解释由此产生的训练动力学以及刻画不同轨迹的相变现象?为探究随机性对神经网络训练动力学和结果的影响,我们使用不同随机种子多次训练模型,并在训练过程中计算多种指标,例如神经网络权重的 $L_2$ 范数、均值和方差。随后,我们基于所得指标序列拟合隐马尔可夫模型(HMM)。该模型将训练过程表征为潜在状态之间的随机转移过程,直观呈现训练期间发生的重大变化。通过该方法,我们在"顿悟"任务、图像分类和掩码语言建模任务中构建了训练动力学的低维离散表示,并利用HMM表示研究相变现象,识别出延缓收敛速度的潜在"绕路"状态。