Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key-value cache. During sleep, the model performs $N$ offline recurrent passes over the accumulated context and updates the fast weights in its state-space model (SSM) blocks through a learned local rule. During inference, this shifts extra computation to sleep while preserving the latency of wake-time prediction. We test our method on controlled synthetic tasks, including cellular automata and multi-hop graph retrieval, as well as a realistic math reasoning task, on which a regular transformer as well as SSM-attention hybrid models fail. We then show that increasing sleep duration $N$ for our models improves performance, with the largest gains on examples that require deeper reasoning.
翻译:基于Transformer的大语言模型越来越多地被用于长时域任务,然而其注意力机制在上下文长度扩展时效率急剧下降。为解决这一问题,我们研究了一种类似睡眠的整合机制,使模型能够定期将近期上下文转化为持久化的快速权重,随后清空其键值缓存。在“睡眠”阶段,模型对累积上下文执行$N$次离线循环处理,并通过自学习局部规则更新其状态空间模型(SSM)模块中的快速权重。在推理阶段,该方法将额外计算转移至睡眠阶段,同时保持醒时预测的延迟性能。我们在受控合成任务(包括元胞自动机与多跳图检索)以及需要深度推理的数学推理任务上验证了该方法——在这些任务中,常规Transformer及SSM-注意力混合模型均无法解决。实验表明,增加模型睡眠时长$N$可提升性能,尤其对需要更深入推理的样本提升最为显著。