Mixture-of-Experts (MoE) architectures scale large language models (LLMs) by activating only a subset of experts per token, but the standard TopK routing assigns the same fixed number of experts to all tokens, ignoring their varying complexity. Prior adaptive routing methods introduce additional modules and hyperparameters, often requiring costly retraining from scratch. We propose Sequence-level TopK (SeqTopK), a minimal modification that shifts the expert budget from the token level to the sequence level. By selecting the top $T \cdot K$ experts across all $T$ tokens, SeqTopK enables end-to-end learned dynamic allocation -- assigning more experts to difficult tokens and fewer to easy ones -- while preserving the same overall budget. SeqTopK requires only a few lines of code, adds less than 1% overhead, and remains fully compatible with pretrained MoE models. Experiments across math, coding, law, and writing show consistent improvements over TopK and prior parameter-free adaptive methods, with gains that become substantially larger under higher sparsity (up to 16.9%). These results highlight SeqTopK as a simple, efficient, and scalable routing strategy, particularly well-suited for the extreme sparsity regimes of next-generation LLMs. Code is available at https://github.com/Y-Research-SBU/SeqTopK.
翻译:混合专家(Mixture-of-Experts,MoE)架构通过仅为每个令牌激活专家子集来扩展大语言模型(LLMs),但标准的TopK路由方法为所有令牌分配固定数量的专家,忽略了其复杂度差异。先前自适应路由方法需引入额外模块和超参数,通常需要从头进行昂贵的重新训练。我们提出序列级TopK(SeqTopK),这是一种最小化修改方案,将专家预算从令牌级转移至序列级。通过在所有$T$个令牌中选取前$T \cdot K$个专家,SeqTopK实现了端到端学习的动态分配——为困难令牌分配更多专家,为简单令牌分配更少专家——同时保持总体预算不变。SeqTopK仅需数行代码实现,增加不足1%的开销,且完全兼容预训练的MoE模型。在数学、编程、法律和写作领域的实验表明,其性能持续优于TopK及先前的无参数自适应方法,且在更高稀疏度下增益显著扩大(最高达16.9%)。这些结果凸显了SeqTopK作为一种简单、高效且可扩展的路由策略,尤其适用于下一代LLMs的极端稀疏场景。代码发布于https://github.com/Y-Research-SBU/SeqTopK。