Mixture-of-Experts (MoE) models lack explicit constraints to ensure the router's decisions align well with the experts' capabilities, which ultimately limits model performance. To address this, we propose expert-router coupling (ERC) loss, a lightweight auxiliary loss that tightly couples the router's decisions with expert capabilities. Our approach treats each expert's router embedding as a proxy token for the tokens assigned to that expert, and feeds perturbed router embeddings through the experts to obtain internal activations. The ERC loss enforces two constraints on these activations: (1) Each expert must exhibit higher activation for its own proxy token than for the proxy tokens of any other expert. (2) Each proxy token must elicit stronger activation from its corresponding expert than from any other expert. These constraints jointly ensure that each router embedding faithfully represents its corresponding expert's capability, while each expert specializes in processing the tokens actually routed to it. The ERC loss is computationally efficient, operating only on n^2 activations, where n is the number of experts. This represents a fixed cost independent of batch size, unlike prior coupling methods that scale with the number of tokens (often millions per batch). Through pre-training MoE-LLMs ranging from 3B to 15B parameters and extensive analysis on trillions of tokens, we demonstrate the effectiveness of the ERC loss. Moreover, the ERC loss offers flexible control and quantitative tracking of expert specialization levels during training, providing valuable insights into MoEs.
翻译:混合专家(MoE)模型缺乏明确的约束机制来确保路由器的决策与专家能力良好对齐,这最终限制了模型性能。为解决此问题,我们提出专家-路由器耦合(ERC)损失,这是一种轻量级辅助损失,可将路由器的决策与专家能力紧密耦合。我们的方法将每个专家的路由器嵌入视为分配给该专家的标记的代理标记,并通过专家输入扰动后的路由器嵌入以获得内部激活。ERC损失对这些激活施加两个约束:(1)每个专家必须对其自身的代理标记表现出比任何其他专家的代理标记更高的激活。(2)每个代理标记必须从其对应专家处获得比其他任何专家更强的激活。这些约束共同确保每个路由器嵌入能够忠实反映其对应专家的能力,同时每个专家专门处理实际路由给它的标记。ERC损失在计算上是高效的,仅需处理n^2个激活(其中n为专家数量)。这代表了一个与批次大小无关的固定成本,不同于先前耦合方法随标记数量(通常每批次数百万)而扩展。通过在3B至15B参数的MoE-LLM上进行预训练,并对数万亿标记进行广泛分析,我们证明了ERC损失的有效性。此外,ERC损失在训练过程中提供了对专家专业化水平的灵活控制和定量跟踪,为MoE模型提供了有价值的见解。