The $qs$ Inequality: Quantifying the Double Penalty of Mixture-of-Experts at Inference

Mixture-of-Experts (MoE) models deliver high quality at low training FLOPs, but this efficiency often vanishes at inference. We identify a double penalty that structurally disadvantages MoE architectures during decoding: first, expert routing fragments microbatches and reduces weight reuse; second, massive resident expert pools reduce high-bandwidth memory (HBM) headroom for the KV cache. This phenomenon, formalized as reuse fragmentation, pushes feed-forward networks (FFNs) into a bandwidth-bound regime, especially at long context lengths. We introduce the $qs$ inequality, a predictive criterion that identifies when MoE is structurally disadvantaged relative to a quality-matched dense model. This criterion unifies sparsity ($s$), the fraction of parameters activated per token, and the quality-equivalence factor ($q$), the size multiplier required for a dense model to match MoE performance. Our evaluation across frontier models including DeepSeek-V3, Qwen3-235B, Grok-1, and Switch-C demonstrates that this fragmentation is a general architectural phenomenon. For DeepSeek-V3 at 128k context, this results in a 4.5x throughput advantage for a quality-matched dense baseline. Crucially, massive architectures like Switch-C can become infeasible on cluster sizes where a quality-matched dense model remains viable. Our results suggest that training-time FLOP efficiency is an incomplete proxy for inference-time performance in long-context serving. They also indicate that MoE may be best viewed as a training-time optimization, with distillation into dense models as a possible path toward inference-efficient deployment.

翻译：混合专家（MoE）模型以较低的训练FLOPs实现了高质量输出，但这种效率在推理阶段往往消失。我们发现了一种在解码过程中结构性不利于MoE架构的双重惩罚：首先，专家路由会分割微批次并降低权重复用率；其次，庞大的常驻专家池会挤占键值缓存（KV cache）所需的高带宽内存（HBM）空间。这种现象被形式化为复用碎片化，它将前馈网络（FFN）推入带宽受限的状态，在长上下文场景下尤为显著。我们提出了$qs$不等式，这是一个预测性准则，用于判断MoE在何时相对于性能匹配的稠密模型处于结构性劣势。该准则统一了稀疏度（$s$，即每个token激活的参数比例）与质量等效因子（$q$，即稠密模型为匹配MoE性能所需的规模乘数）。我们在包括DeepSeek-V3、Qwen3-235B、Grok-1和Switch-C在内的前沿模型上的评估表明，这种碎片化是一种普遍的架构现象。对于上下文长度为128k的DeepSeek-V3，这导致性能匹配的稠密基线模型获得了4.5倍的吞吐量优势。关键的是，像Switch-C这样的大规模架构，在质量匹配的稠密模型仍可运行的集群规模上，可能会变得不可行。我们的结果表明，训练时的FLOP效率并不能完全代表长上下文服务中推理时的性能。这也意味着MoE或许最好被视为一种训练时优化技术，而通过蒸馏到稠密模型可能是实现高效推理部署的一条可行路径。