Routing without Forgetting

Continual learning in transformers is commonly addressed through parameter-efficient adaptation: prompts, adapters, or LoRA modules are specialized per task while the backbone remains frozen. Although effective in controlled multi-epoch settings, these approaches rely on gradual gradient-based specialization and struggle in Online Continual Learning (OCL), where data arrive as a non-stationary stream and each sample may be observed only once. We recast continual learning in transformers as a routing problem: under strict online constraints, the model must dynamically select the appropriate representational subspace for each input without explicit task identifiers or repeated optimization. We thus introduce Routing without Forgetting (RwF), a transformer architecture augmented with energy-based associative retrieval layers inspired by Modern Hopfield Networks. Instead of storing or merging task-specific prompts, RwF generates dynamic prompts through single-step associative retrieval over the transformer token embeddings at each layer. Retrieval corresponds to the closed-form minimization of a strictly convex free-energy functional, enabling input-conditioned routing within each forward pass, independently of iterative gradient refinement. Across challenging class-incremental benchmarks, RwF improves over existing prompt-based methods. On Split-ImageNet-R and Split-ImageNet-S, RwF outperforms prior prompt-based approaches by a large margin, even in few-shot learning regimes. These results indicate that embedding energy-based associative routing directly within the transformer backbone provides a principled and effective foundation for OCL.

翻译：在Transformer中，持续学习通常通过参数高效适应来解决：提示、适配器或LoRA模块针对每个任务专门化，而主干网络保持冻结。尽管在受控的多轮次设置中有效，但这些方法依赖于基于梯度的渐进专门化，并在在线持续学习（OCL）中遇到困难，其中数据以非平稳流的形式到达，每个样本可能仅被观察到一次。我们将Transformer中的持续学习重新表述为一个路由问题：在严格的在线约束下，模型必须为每个输入动态选择适当的表示子空间，而无需显式的任务标识符或重复优化。因此，我们引入了路由而不遗忘（RwF），这是一种通过受现代Hopfield网络启发的基于能量的关联检索层增强的Transformer架构。RwF不存储或合并任务特定的提示，而是通过在每个层的Transformer令牌嵌入上进行单步关联检索来生成动态提示。检索对应于严格凸自由能泛函的闭式最小化，使得在每个前向传播中能够进行输入条件路由，独立于迭代梯度优化。在具有挑战性的类增量基准测试中，RwF优于现有的基于提示的方法。在Split-ImageNet-R和Split-ImageNet-S上，RwF大幅领先于先前的基于提示的方法，即使在少样本学习机制下也是如此。这些结果表明，将基于能量的关联路由直接嵌入Transformer主干网络为OCL提供了一个原则性且有效的基础。