We propose a fresh take on understanding the mechanisms of neural networks by analyzing the rich directional structure of optimization trajectories, represented by their pointwise parameters. Towards this end, we introduce some natural notions of the complexity of optimization trajectories, both qualitative and quantitative, which hallmark the directional nature of optimization in neural networks: when is there redundancy, and when exploration. We use them to reveal the inherent nuance and interplay involved between various optimization choices, such as momentum and weight decay. Further, the trajectory perspective helps us see the effect of scale on regularizing the directional nature of trajectories, and as a by-product, we also observe an intriguing heterogeneity of Q,K,V dynamics in the middle attention layers in LLMs and which is homogenized by scale. Importantly, we put the significant directional redundancy observed to the test by demonstrating that training only scalar batchnorm parameters some while into training matches the performance of training the entire network, which thus exhibits the potential of hybrid optimization schemes that are geared towards efficiency.
翻译:我们提出了一种新颖的方法来理解神经网络的机制,即通过分析优化轨迹丰富的方向性结构,这些轨迹由其逐点参数表示。为此,我们引入了一些关于优化轨迹复杂性的自然概念,包括定性和定量两个方面,这些概念标志着神经网络优化的方向性本质:何时存在冗余,何时进行探索。我们利用这些概念揭示了不同优化选择(如动量和权重衰减)之间固有的细微差别和相互作用。此外,轨迹视角帮助我们理解规模对轨迹方向性本质的正则化效应,作为副产品,我们还观察到大型语言模型中间注意力层中Q、K、V动态的有趣异质性,而这种异质性会随着规模增大而趋于同质化。重要的是,我们通过实验验证了观察到的显著方向冗余性:在训练一段时间后,仅训练批量归一化的标量参数即可达到训练整个网络的性能,这展示了面向效率的混合优化方案的潜力。