Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts

Sparse MoE routing fails at domain transitions, where the current token belongs to one distribution and the next to another. In a controlled experiment (4 experts, 5 seeds), standard affinity routing assigns only 0.006 +/- 0.001 probability to the correct expert at the transition. Three lightweight gate modifications raise this to 0.748 +/- 0.002 (124x), cutting experts needed for 99% coverage from infeasible to a small constant: temporal memory (beta), a per-expert LIF membrane potential accumulating routing context across tokens; precision-weighted gating (Pi), a per-expert inverse variance of recent prediction error, yielding 31x contrast between reliable and unreliable experts; and anticipatory routing, a next-state predictor conditioned on the beta-accumulated hidden state. The mechanisms draw from Friston's Free Energy Principle and use LIF dynamics from spiking neural networks. An ablation across all 2^3 subsets reveals a super-additive beta x Ant interaction: anticipation alone gives nothing (+0.000 +/- 0.001); beta alone gives modest gain (+0.295 +/- 0.013); combined they close 75% of the oracle gap (+0.741 +/- 0.002, exceeding the sum by +0.446 +/- 0.014). This is structural: a stateless predictor cannot detect approaching transitions because pre-transition tokens are distributionally identical to within-domain tokens. In a character-level MoE LM (5 seeds), beta-routing reduces transition-step BPC from 6.56 +/- 0.01 (Standard) to 4.01 +/- 0.15 (beta-MoE); the beta + Ant gate places 0.86 +/- 0.02 probability on the correct domain expert before that domain appears in input, vs 0.42 +/- 0.12 for Standard MoE. Reference implementations (~200 lines each): https://github.com/russellwmy/affinity-is-not-enough

翻译：稀疏MoE路由在域转换时失效，此时当前词元属于一个分布而下一个词元属于另一个分布。在控制实验（4个专家，5个随机种子）中，标准亲和力路由在转换点分配给正确专家的概率仅为0.006±0.001。三种轻量级门控改进将此概率提升至0.748±0.002（提升124倍），使实现99%覆盖率所需的专家数从不可行变为一个小的常数：时间记忆（beta），一种跨词元累积路由上下文的每专家LIF膜电位；精度加权门控（Pi），一种基于近期预测误差的每专家逆方差，在可靠与不可靠专家间产生31倍对比度；以及预测性路由，一种基于beta累积隐藏状态的条件化下一状态预测器。这些机制借鉴了弗里斯顿的自由能原理，并使用了脉冲神经网络中的LIF动力学。对所有2^3个子集的消融实验揭示了超加性的beta×Ant交互作用：单独使用预测性路由无增益（+0.000±0.001）；单独使用beta带来适度增益（+0.295±0.013）；两者结合填补了75%的参考上限差距（+0.741±0.002，超过两者之和达+0.446±0.014）。这具有结构性原因：无状态预测器无法检测即将到来的转换，因为转换前词元在分布上与域内词元完全相同。在字符级MoE语言模型（5个随机种子）中，beta路由将转换步的BPC从6.56±0.01（标准）降至4.01±0.15（beta-MoE）；beta+Ant门控在目标域出现于输入前就将0.86±0.02的概率分配给正确的域专家，而标准MoE仅为0.42±0.12。参考实现（每项约200行代码）：https://github.com/russellwmy/affinity-is-not-enough