Learning to Forget Attention: Memory Consolidation for Adaptive Compute Reduction

Hybrid architectures combining state-space models with attention have achieved strong efficiency-quality tradeoffs, yet existing approaches either apply attention uniformly or learn static sparse patterns. This misses a key opportunity: \emph{attention demand should decrease over time as recurring patterns become familiar}. We present a surprising finding from analyzing GPT-2 models: \textbf{88\%} of attention operations retrieve information already predictable from the model's hidden state, and this redundancy does \emph{not} decrease during training. Motivated by this observation, we introduce \textbf{\ours{}} (\textbf{C}onsolidation-based \textbf{R}outing for \textbf{A}daptive \textbf{M}emory), a biologically inspired memory consolidation mechanism that gradually distills episodic retrievals into parametric semantic memory. Unlike prior sparse attention methods, \ours{} exhibits \emph{decreasing attention utilization} over training, achieving a \textbf{37.8$\times$} reduction through a sharp phase transition at approximately 3K steps. We prove that this capability is \emph{impossible} without consolidation: any static routing scheme requires $Ω(f \cdot n)$ attention for tasks with recurring patterns of frequency $f$. On our proposed SRCD benchmark, \ours{} achieves \textbf{100\% retrieval accuracy} at 1.6\% attention compute (vs.\ 68\% for baselines), and consolidated patterns transfer to unseen tasks with \textbf{48--52\%} attention reduction without retraining. Remarkably, the learned consolidation dynamics quantitatively match human episodic-to-semantic memory transition curves from cognitive psychology ($γ= 0.43$ vs.\ $γ_{\text{human}} \approx 0.4$--$0.5$). Code and benchmarks are available at [anonymized].

翻译：结合状态空间模型与注意力的混合架构已实现优异的效率-质量权衡，但现有方法要么均匀施加注意力，要么学习静态稀疏模式。这错失了一个关键机遇：\emph{随着重复模式逐渐熟悉，注意力需求应随时间递减}。我们通过分析GPT-2模型获得了一个惊人发现：\textbf{88\%}的注意力操作检索的信息已可从模型隐藏状态预测，且这种冗余在训练过程中\emph{并未}减少。受此观察启发，我们提出\textbf{\ours{}}（基于巩固的自适应记忆路由），这是一种受生物学启发的记忆巩固机制，能够逐步将情景检索提炼为参数化语义记忆。与先前的稀疏注意力方法不同，\ours{}在训练过程中表现出\emph{注意力利用率递减}的特性，通过约3K步处的急剧相变实现了\textbf{37.8$\times$}的计算缩减。我们证明这种能力在没有巩固机制的情况下是\emph{不可能}实现的：任何静态路由方案对于频率为$f$的重复模式任务都需要$Ω(f \cdot n)$的注意力计算。在我们提出的SRCD基准测试中，\ours{}以1.6\%的注意力计算量实现了\textbf{100\%检索准确率}（基线方法为68\%），且巩固后的模式可迁移至未见任务，无需重新训练即可实现\textbf{48--52\%}的注意力缩减。值得注意的是，学习到的巩固动力学在定量上与认知心理学中人类情景记忆到语义记忆的转换曲线相匹配（$γ= 0.43$ vs.\ $γ_{\text{人类}} \approx 0.4$--$0.5$）。代码与基准测试已发布于[匿名链接]。