HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models

We introduce HubRouter, a pluggable module that replaces O(n^2) attention layers with O(nM) hub-mediated routing, where M << n is a small number of learned hub tokens. We demonstrate it in two from-scratch architectures: a Jamba-style hybrid and a 12-layer Transformer; retrofit into pretrained models is a tested negative case. HubRouter implements an encode-decode-score-council pipeline: M learned hubs cross-attend to all tokens, tokens project against hubs for routing fingerprints, a score head selects top-k tokens, and a sparse council attends only to the selected subset. We validate HubRouter in three settings. (1) Hub-Jamba yields a nominal 4.2% PPL improvement (200.2 vs 209.0, single seed; possibly within seed noise) and up to ~90x training throughput at sequence length 1024 in matched PyTorch-native baselines; an optimised baseline would narrow this to ~10-15x. (2) Graduated replacement of 25% of Transformer attention layers gives the best perplexity in our matched-budget sweep (268.0 vs 282.4 pure Transformer). (3) Hub-GPT provides strictly causal routing, achieving PPL 211.5 +/- 0.4 over 3 seeds (post council-causal fix); approximately 3 PPL worse than Jamba's 208.5 +/- 0.7, a measurable quality cost for avoiding O(n^2) computation. Post-fix, chunk size C has little effect; the pre-fix chunk-size benefit was an artifact of a bidirectional-council leak we found in adversarial review. A multi-seed hub-count sweep (~105 runs across M=1-32) reveals M=8-14 as the reliably-converging sub-band (4-5/5 seeds); M=6 is rescued to 5/5 by orthogonal regularization, while M>=20 shows increasing seed sensitivity. Companion paper arXiv:2603.20997 (Basu, 2026) defines the routing diagnostic task. Code and scripts will be released.

翻译：摘要：我们提出HubRouter，这是一种可插拔模块，能够将O(n^2)注意力层替换为O(nM)的枢纽中介路由，其中M << n是少量可学习的枢纽令牌。我们在两种从头构建的架构中验证了其性能：一种Jamba风格的混合模型和一个12层Transformer；将预训练模型改造为HubRouter被测试为负面案例。HubRouter实现了一个编码-解码-评分-议会流程：M个可学习枢纽与所有令牌进行交叉注意力，令牌向枢纽投影以生成路由指纹，评分头选择top-k令牌，稀疏议会仅关注所选子集。我们在三个场景中验证了HubRouter：（1）Hub-Jamba在匹配的PyTorch原生基线中，序列长度为1024时，实现了标称的4.2%困惑度改进（200.2 vs 209.0，单次种子；可能处于种子噪声范围内），训练吞吐量提升高达约90倍；优化后的基线将这一优势缩小至约10-15倍。（2）在匹配预算扫描中，渐进替换25%的Transformer注意力层获得了最佳困惑度（268.0 vs 282.4纯Transformer）。（3）Hub-GPT提供严格因果路由，其在3次种子上困惑度为211.5 ± 0.4（议会因果修复后）；比Jamba的208.5 ± 0.7高出约3点困惑度，这是避免O(n^2)计算的可衡量质量代价。修复后，块大小C影响甚微；修复前的块大小优势是我们在对抗性审查中发现的双向议会泄漏的伪像。多种子枢纽数量扫描（M=1-32，约105次运行）揭示M=8-14为可靠收敛子带（4-5/5种子）；M=6可通过正交正则化恢复至5/5，而M≥20时种子敏感性增加。配套论文arXiv:2603.20997（Basu, 2026）定义了路由诊断任务。代码和脚本将开源发布。