Hyper-Connections (HC) generalizes residual connections by introducing dynamic residual matrices that mix information across multiple residual streams, accelerating convergence in deep neural networks. However, unconstrained residual matrices can compromise training stability. To address this, DeepSeek's Manifold-Constrained Hyper-Connections (mHC) approximately projects these matrices onto the Birkhoff polytope via iterative Sinkhorn--Knopp (SK) normalization. We identify two limitations of this approach: (i) finite SK iterations do not guarantee exact doubly stochasticity, leaving an approximation gap that can accumulate through network depth and undermine stability; (ii) efficient SK implementation requires highly specialized CUDA kernels, raising engineering barriers and reducing portability. Motivated by the Birkhoff--von Neumann theorem, we propose mHC-lite, a simple reparameterization that explicitly constructs doubly stochastic matrices as convex combinations of permutation matrices. This approach guarantees exact doubly stochasticity by construction and can be implemented using only native matrix operations. Extensive experiments demonstrate that mHC-lite matches or exceeds mHC in performance while achieving higher training throughput with a naive implementation and eliminating the residual instabilities observed in both HC and mHC. The code is publicly available at https://github.com/FFTYYY/mhc-lite.
翻译:超连接通过引入动态残差矩阵来泛化残差连接,这些矩阵在多个残差流之间混合信息,从而加速深度神经网络的收敛。然而,无约束的残差矩阵可能损害训练稳定性。为解决此问题,DeepSeek的流形约束超连接通过迭代Sinkhorn-Knopp归一化,将这些矩阵近似投影到Birkhoff多胞体上。我们指出了该方法的两点局限性:(i) 有限的SK迭代无法保证精确的双随机性,会留下一个近似误差,该误差可能随网络深度累积并破坏稳定性;(ii) 高效的SK实现需要高度专业化的CUDA内核,这提高了工程门槛并降低了可移植性。受Birkhoff-von Neumann定理启发,我们提出了mHC-lite,这是一种简单的重参数化方法,将双随机矩阵显式构造为置换矩阵的凸组合。该方法通过构造保证了精确的双随机性,并且仅需使用原生矩阵运算即可实现。大量实验表明,mHC-lite在性能上匹配甚至超越了mHC,同时通过朴素实现获得了更高的训练吞吐量,并消除了在HC和mHC中均观察到的残余不稳定性。代码已在https://github.com/FFTYYY/mhc-lite公开。