Bottlenecked Transformers: Periodic KV Cache Consolidation for Generalised Reasoning

Transformer LLMs have been shown to exhibit strong reasoning ability that scales with inference-time compute, most prominently through token-space "thinking" chains of thought. A growing line of work pushes extra computation into the model's latent space, which we term Auxiliary Latent-Space Computation (ALSC). Existing ALSC methods largely fall into three buckets: (i) token-mediated latent rollouts, (ii) residual/activation steering, and (iii) memory (KV) compression. An underexplored alternative is memory consolidation/reconsolidation, two processes in the brain that are responsible for stabilising newly formed memory traces, and, upon recall, transiently rendering established traces plastic such they can integrate new contextual information before restabilising. In Transformer LLMs, this can be seen as analogous to performing in-place rewrites of new KV segments, and rewrites of recalled past segments. In this work, we give a theoretical justification as to why memory (re)consolidation via KV cache rewrites is beneficial for improved reasoning. We do this through the lens of Information Bottleneck (IB) theory, which posits that model generalisation emerges from an optimal balance between input information compression and retention of predictive information in latent representations. We then introduce the Bottlenecked Transformer, which augments a backbone LLM with a Cache Processor, an auxiliary Transformer that performs periodic, non-causal, in-place KV rewrites at newline-delimited reasoning step boundaries. The Processor consolidates recently written KV entries and reconsolidates a small, top-k attention-selected set of prior entries. We evaluate our Bottlenecked Transformer architecture on math reasoning benchmarks. Our model sees consistent performance gains over vanilla Transformers and pause-token augmented baselines, with gains of up to +6.6pp for selected tasks/backbones.

翻译：Transformer大型语言模型展现出随推理计算量增强的强推理能力，最典型的是通过令牌空间“思考”链式推理。一系列前沿工作致力于将额外计算推入模型潜在空间，我们称之为辅助潜在空间计算。现有ALSC方法主要分为三类：(i) 令牌介导的潜在展开、(ii) 残差/激活引导、(iii) 记忆（键值）压缩。一个尚未充分探索的方案是记忆巩固/再巩固——大脑中负责稳定新形成记忆痕迹的两个过程，在回忆时暂时使已建立痕迹具有可塑性，使其在重新稳定前整合新上下文信息。在Transformer LLM中，这类似于对新KV段进行原位重写，以及对已回忆的过去段进行重写。本研究从理论上论证了通过KV缓存重写实现记忆（再）巩固对提升推理能力的益处，我们通过信息瓶颈理论进行阐释——该理论认为模型泛化源于输入信息压缩与潜在表征中预测信息保留之间的最优平衡。由此我们提出瓶颈Transformer架构：通过缓存处理器（一个辅助Transformer）在换行分隔的推理步骤边界执行周期性、非因果、原位KV重写，增强骨干LLM。该处理器会巩固近期写入的KV条目，并对经top-k注意力机制选取的少量先验条目进行再巩固。我们在数学推理基准上评估该瓶颈Transformer架构，其在选定任务/骨干上相比原始Transformer和暂停令牌增强基线获得持续性能提升，最高提升达6.6个百分点。