Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

Sparse Mixture-of-Experts (SMoE) models enable scaling language models efficiently, but training them remains challenging, as routing can collapse onto few experts and auxiliary load-balancing losses can reduce specialization. Motivated by these hurdles, we study how routing decisions in SMoEs are formed mechanistically. First, we reveal a geometric coupling between routers and their corresponding experts. For a given token, the router weights for the selected expert and the expert weights processing it receive gradients along the same input direction, differing only in scalar coefficients. Thus, matched router--expert directions accumulate the same routed token history. This theoretical coupling also appears empirically in routing dynamics. In a $1$B SMoE trained from scratch, higher router scores predict stronger expert neuron activations, showing that routing decisions are mirrored inside the selected expert. Next, we analyze the effects of auxiliary load balancing on the router--expert geometric coupling, showing that such losses break this structure by spreading input-directed gradients across router weights, making distinct router directions nearly three times more similar to each other. Last, we demonstrate the centrality of geometric coupling for effective routing with a parameter-free online K-Means router, in which each expert maintains a running average of the hidden states routed to it and tokens are assigned based on cosine similarity. Compared with auxiliary-loss and loss-free balancing, this router achieves the lowest load imbalance with only a modest perplexity increase, indicating that geometric coupling captures a substantial part of what the router learns. Overall, our results explain how routers form assignment geometry that supports an effective division of labor.

翻译：稀疏混合专家（Sparse Mixture-of-Experts, SMoE）模型能够高效扩展语言模型规模，但其训练仍面临挑战：路由可能坍缩到少数专家，而辅助负载均衡损失可能降低专家专业化程度。受这些难题启发，我们研究了SMoE中路由决策的机制形成过程。首先，我们揭示了路由器与对应专家之间存在一种几何耦合关系。对于给定词元，所选专家的路由权重与处理该词元的专家权重会沿相同输入方向获得梯度，二者仅在标量系数上存在差异。因此，匹配的路由-专家方向会累积相同的路由词元历史信息。这种理论耦合在路由动态中也得到经验验证：在从零训练的1B规模SMoE中，更高的路由分数预测更强的专家神经元激活，表明路由决策会镜像反映在所选的专家内部。其次，我们分析了辅助负载均衡对路由-专家几何耦合的影响，表明此类损失通过将输入方向梯度分散到路由权重上，破坏了这种结构，使得不同路由方向之间的相似性提高近三倍。最后，我们通过一个无参数在线K-Means路由器证明了几何耦合对有效路由的核心作用——该路由器让每个专家维护路由至其隐藏状态的滑动平均，并基于余弦相似度分配词元。与辅助负载均衡和无损失均衡方法相比，该路由器在仅适度增加困惑度的前提下实现了最低的负载不均衡度，表明几何耦合捕获了路由器学习内容中相当重要的部分。总体而言，我们的结果揭示了路由器如何形成支持有效分工的分配几何结构。