Large-scale neural network training increasingly relies on matrix-aware optimizers that exploit the structure of weight parameters beyond element-wise adaptation. However, existing matrix-aware methods such as Muon have an underappreciated vulnerability: their core operation, Newton-Schulz iteration, depends critically on input conditioning, yet the raw momentum matrices exhibit severe coordinate-wise scale heterogeneity. In this paper, we first verify this scale heterogeneity through a chi-square uniformity test, showing that intra-matrix scale imbalance is prevalent across Transformer layers and that coordinate whitening effectively corrects it. Motivated by this finding, we propose Zeta, a dual whitening optimizer that applies coordinate whitening and spectral whitening in a strictly ordered pipeline. The ordering is not a tunable choice but follows from a mathematical dependency: coordinate whitening establishes the statistical isotropy that spectral whitening requires to function reliably. We further prove that this dual pipeline strictly reduces orthogonalization error relative to pure spectral methods by improving the condition number of the input. Empirically, Zeta matches or surpasses strong baselines across language modeling (0.6B to 8B parameters), mixture-of-experts architectures, and vision tasks, demonstrating that resolving scale imbalance before orthogonalization leads to faster convergence and better generalization. Code is available at https://github.com/AIGCodeOS/aigcode_zeta_optimizer.
翻译:大规模神经网络训练日益依赖于能够利用权重参数结构(而非逐元素自适应)的矩阵感知优化器。然而,现有如Muon等矩阵感知方法存在一个未被充分重视的脆弱性:其核心操作——牛顿-舒尔茨迭代——严重依赖于输入的条件化,而原始动量矩阵却表现出严重的坐标方向尺度异质性。本文首先通过卡方均匀性检验验证了这种尺度异质性,证明矩阵内部尺度不平衡在Transformer各层中普遍存在,且坐标白化能有效纠正此问题。基于该发现,我们提出Zeta——一种采用严格顺序化流水线执行坐标白化与谱白化的双白化优化器。该顺序并非可调节选项,而是源于数学依赖性:坐标白化建立了谱白化可靠运行所需的统计各向同性。我们进一步证明,相较于纯谱方法,双流水线通过改善输入条件数能严格降低正交化误差。实验表明,Zeta在语言建模(0.6B至8B参数量)、混合专家架构及视觉任务中均达到或超越强基线水平,证实先解决尺度不平衡再执行正交化可带来更快的收敛速度与更优的泛化性能。代码开源地址:https://github.com/AIGCodeOS/aigcode_zeta_optimizer