Understanding gradient descent dynamics is key to explaining the success of over-parameterized models, where implicit bias manifests through conservation laws in gradient flow. While such laws are well understood for linear and ReLU networks, they remain largely unexplored for modern architectures. This work develops a unified framework to characterize conservation laws for contemporary models, including feedforward networks with GELU, SiLU, and SwiGLU activations, multihead attention with sinusoidal and rotary positional encodings, and Mixture-of-Experts architectures under diverse gating designs. Our theoretical findings are supported by experiments that validate the predicted invariants.
翻译:理解梯度下降动力学是解释过参数化模型成功的关键,其中隐式偏差通过梯度流中的守恒定律得以体现。尽管此类定律在线性网络和ReLU网络中已被充分理解,但在现代架构中仍基本未被探索。本研究发展了一个统一框架,用于刻画当代模型的守恒定律,包括采用GELU、SiLU和SwiGLU激活函数的前馈网络,应用正弦和旋转位置编码的多头注意力机制,以及多种门控设计下的混合专家架构。我们的理论发现得到了实验验证,实验结果证实了所预测的不变量。