Mixture-of-Experts (MoE) models deliver high quality at low training FLOPs, but this efficiency often vanishes at inference. We identify a double penalty that structurally disadvantages MoE architectures during decoding: first, expert routing fragments microbatches and reduces weight reuse; second, massive resident expert pools reduce high-bandwidth memory (HBM) headroom for the KV cache. This phenomenon, formalized as reuse fragmentation, pushes feed-forward networks (FFNs) into a bandwidth-bound regime, especially at long context lengths. We introduce the $qs$ inequality, a predictive criterion that identifies when MoE is structurally disadvantaged relative to a quality-matched dense model. This criterion unifies sparsity ($s$), the fraction of parameters activated per token, and the quality-equivalence factor ($q$), the size multiplier required for a dense model to match MoE performance. Our evaluation across frontier models including DeepSeek-V3, Qwen3-235B, Grok-1, and Switch-C demonstrates that this fragmentation is a general architectural phenomenon. For DeepSeek-V3 at 128k context, this results in a 4.5x throughput advantage for a quality-matched dense baseline. Crucially, massive architectures like Switch-C can become infeasible on cluster sizes where a quality-matched dense model remains viable. Our results suggest that training-time FLOP efficiency is an incomplete proxy for inference-time performance in long-context serving. They also indicate that MoE may be best viewed as a training-time optimization, with distillation into dense models as a possible path toward inference-efficient deployment.
翻译:混合专家(MoE)模型以较低的训练FLOPs实现了高质量输出,但这种效率在推理阶段往往消失。我们发现了一种在解码过程中结构性不利于MoE架构的双重惩罚:首先,专家路由会分割微批次并降低权重复用率;其次,庞大的常驻专家池会挤占键值缓存(KV cache)所需的高带宽内存(HBM)空间。这种现象被形式化为复用碎片化,它将前馈网络(FFN)推入带宽受限的状态,在长上下文场景下尤为显著。我们提出了$qs$不等式,这是一个预测性准则,用于判断MoE在何时相对于性能匹配的稠密模型处于结构性劣势。该准则统一了稀疏度($s$,即每个token激活的参数比例)与质量等效因子($q$,即稠密模型为匹配MoE性能所需的规模乘数)。我们在包括DeepSeek-V3、Qwen3-235B、Grok-1和Switch-C在内的前沿模型上的评估表明,这种碎片化是一种普遍的架构现象。对于上下文长度为128k的DeepSeek-V3,这导致性能匹配的稠密基线模型获得了4.5倍的吞吐量优势。关键的是,像Switch-C这样的大规模架构,在质量匹配的稠密模型仍可运行的集群规模上,可能会变得不可行。我们的结果表明,训练时的FLOP效率并不能完全代表长上下文服务中推理时的性能。这也意味着MoE或许最好被视为一种训练时优化技术,而通过蒸馏到稠密模型可能是实现高效推理部署的一条可行路径。