Sparse MoE routing fails at domain transitions, where the current token belongs to one distribution and the next to another. In a controlled experiment (4 experts, 5 seeds), standard affinity routing assigns only 0.006 +/- 0.001 probability to the correct expert at the transition. Three lightweight gate modifications raise this to 0.748 +/- 0.002 (124x), cutting experts needed for 99% coverage from infeasible to a small constant: temporal memory (beta), a per-expert LIF membrane potential accumulating routing context across tokens; precision-weighted gating (Pi), a per-expert inverse variance of recent prediction error, yielding 31x contrast between reliable and unreliable experts; and anticipatory routing, a next-state predictor conditioned on the beta-accumulated hidden state. The mechanisms draw from Friston's Free Energy Principle and use LIF dynamics from spiking neural networks. An ablation across all 2^3 subsets reveals a super-additive beta x Ant interaction: anticipation alone gives nothing (+0.000 +/- 0.001); beta alone gives modest gain (+0.295 +/- 0.013); combined they close 75% of the oracle gap (+0.741 +/- 0.002, exceeding the sum by +0.446 +/- 0.014). This is structural: a stateless predictor cannot detect approaching transitions because pre-transition tokens are distributionally identical to within-domain tokens. In a character-level MoE LM (5 seeds), beta-routing reduces transition-step BPC from 6.56 +/- 0.01 (Standard) to 4.01 +/- 0.15 (beta-MoE); the beta + Ant gate places 0.86 +/- 0.02 probability on the correct domain expert before that domain appears in input, vs 0.42 +/- 0.12 for Standard MoE. Reference implementations (~200 lines each): https://github.com/russellwmy/affinity-is-not-enough
翻译:稀疏MoE路由在域转换时失效,此时当前词元属于一个分布而下一个词元属于另一个分布。在控制实验(4个专家,5个随机种子)中,标准亲和力路由在转换点分配给正确专家的概率仅为0.006±0.001。三种轻量级门控改进将此概率提升至0.748±0.002(提升124倍),使实现99%覆盖率所需的专家数从不可行变为一个小的常数:时间记忆(beta),一种跨词元累积路由上下文的每专家LIF膜电位;精度加权门控(Pi),一种基于近期预测误差的每专家逆方差,在可靠与不可靠专家间产生31倍对比度;以及预测性路由,一种基于beta累积隐藏状态的条件化下一状态预测器。这些机制借鉴了弗里斯顿的自由能原理,并使用了脉冲神经网络中的LIF动力学。对所有2^3个子集的消融实验揭示了超加性的beta×Ant交互作用:单独使用预测性路由无增益(+0.000±0.001);单独使用beta带来适度增益(+0.295±0.013);两者结合填补了75%的参考上限差距(+0.741±0.002,超过两者之和达+0.446±0.014)。这具有结构性原因:无状态预测器无法检测即将到来的转换,因为转换前词元在分布上与域内词元完全相同。在字符级MoE语言模型(5个随机种子)中,beta路由将转换步的BPC从6.56±0.01(标准)降至4.01±0.15(beta-MoE);beta+Ant门控在目标域出现于输入前就将0.86±0.02的概率分配给正确的域专家,而标准MoE仅为0.42±0.12。参考实现(每项约200行代码):https://github.com/russellwmy/affinity-is-not-enough