The success of Hyper-Connections (HC) in neural networks (NN) has also highlighted issues related to its training instability and restricted scalability. The Manifold-Constrained Hyper-Connections (mHC) mitigate these challenges by projecting the residual connection space onto a Birkhoff polytope, however, it faces two issues: 1) its iterative Sinkhorn-Knopp (SK) algorithm does not always yield exact doubly stochastic residual matrices; 2) mHC incurs a prohibitive $\mathcal{O}(n^3C)$ parameter complexity with $n$ as the width of the residual stream and $C$ as the feature dimension. The recently proposed mHC-lite reparametrizes the residual matrix via the Birkhoff-von-Neumann theorem to guarantee double stochasticity, but also faces a factorial explosion in its parameter complexity, $\mathcal{O} \left( nC \cdot n! \right)$. To address both challenges, we propose \textbf{KromHC}, which uses the \underline{Kro}necker products of smaller doubly stochastic matrices to parametrize the residual matrix in \underline{mHC}. By enforcing manifold constraints across the factor residual matrices along each mode of the tensorized residual stream, KromHC guarantees exact double stochasticity of the residual matrices while reducing parameter complexity to $\mathcal{O}(n^2C)$. Comprehensive experiments demonstrate that KromHC matches or even outperforms state-of-the-art (SOTA) mHC variants, while requiring significantly fewer trainable parameters. The code is available at \texttt{https://github.com/wz1119/KromHC}.
翻译:超连接(HC)在神经网络(NN)中的成功也凸显了其训练不稳定性和可扩展性受限的问题。流形约束超连接(mHC)通过将残差连接空间投影到 Birkhoff 多胞体上来缓解这些挑战,然而它面临两个问题:1)其迭代的 Sinkhorn-Knopp(SK)算法并不总能产生精确的双随机残差矩阵;2)mHC 的参数复杂度高达 $\mathcal{O}(n^3C)$,其中 $n$ 为残差流的宽度,$C$ 为特征维度。最近提出的 mHC-lite 通过 Birkhoff-von-Neumann 定理对残差矩阵进行重参数化以保证双随机性,但其参数复杂度也面临阶乘爆炸问题,即 $\mathcal{O} \left( nC \cdot n! \right)$。为应对这两个挑战,我们提出了 \textbf{KromHC},该方法使用较小的双随机矩阵的 \underline{克罗内克积} 来参数化 \underline{mHC} 中的残差矩阵。通过对张量化残差流每个模态上的因子残差矩阵施加流形约束,KromHC 保证了残差矩阵的精确双随机性,同时将参数复杂度降低至 $\mathcal{O}(n^2C)$。综合实验表明,KromHC 达到甚至超越了最先进的 mHC 变体,同时所需的可训练参数显著减少。代码发布于 \texttt{https://github.com/wz1119/KromHC}。