Router is the cornerstone component to the Mixture-of-Experts models. Serving as expert proxies, the rows of the router matrix compute their similarity to the MoE inputs to determine which subset of experts is activated. Ideally, each router row is designed to encode the expert matrix into this representative vector, such that its dot-product with token can better reflect token-expert affinity. However, there exists no design principles to enforce this condensation. In this paper, we propose to align each router row with the principal singular direction of the associated expert, as this direction provides the most expressive mathematical description of a matrix. Based on this principle, we propose a router redesign with Manifold Power Iteration (MPI). Specifically, it introduces a "Power-then-Retract" paradigm, where a power iteration step is performed on the router weights, followed by a retraction to impose a norm constraint to ensure both efficiency and stability. Theoretically, we show that MPI drives router rows to converge toward the principal singular directions of associated experts. Empirically, we pretrain MoE model across scales from 1B to 11B parameters to confirm that this alignment facilitates more effective MoE models.
翻译:路由器是混合专家(MoE)模型的核心组件。作为专家的代理,路由器矩阵的行向量通过计算与MoE输入的相似度,决定激活哪些专家子集。理想情况下,每个路由器行向量应被设计为将专家矩阵编码至该代表向量中,从而使其与词元的点积能更好地反映词元-专家亲和度。然而,目前尚无约束这种压缩过程的设计原则。本文提出将每个路由器行向量与对应专家的主奇异方向对齐,因为该方向提供了矩阵最具表现力的数学描述。基于这一原则,我们提出基于流形幂迭代(MPI)的路由器重设计。具体而言,它引入"幂迭代-收缩"范式:首先对路由器权重执行幂迭代步骤,随后通过收缩施加范数约束以确保效率与稳定性。理论上,我们证明MPI驱动路由器行向量收敛至对应专家的主奇异方向。实证方面,我们预训练了参数量从1B到11B的MoE模型,证实这种对齐能够构建更有效的MoE模型。