Large language models achieve strong reasoning performance by scaling inference-time compute, yet remain fundamentally stateless, discarding the rich, self-produced reasoning traces generated during this process. We investigate whether models can instead learn online from this experience, converting transient computation (reasoning traces) into persistent reusable knowledge, and without external supervision or access to future data. We show that In-Context Learning (ICL) over raw reasoning traces fails to generalize, reflecting a fundamental limitation of token-level reuse: individual traces lack the abstraction needed for transfer, even after refinement (e.g. self-reflection). In contrast, drawing inspiration from recent works on unsupervised reinforcement learning, we find that lightweight per-instance training with self-generated test-time signals (majority voting) as rewards yields substantial gains, often surpassing full-dataset offline training, motivating a shift from raw traces to learned latent representations. Building on this insight, we propose an online method that distills inference-time compute spent on encountered problems into compact modular latent memories capturing the underlying reasoning structure. These memories are stored and retrieved for future inputs, enabling continual improvement while avoiding catastrophic forgetting through modular design. Importantly, our method is highly efficient, parametrized as extremely lightweight soft prompt memories (~0.001% of model parameters) and trained with only a few gradient steps, yet achieving performance competitive with full parametric updates and offline training. Across challenging mathematical reasoning benchmarks, our approach significantly outperforms zero-shot and raw data ICL baselines, while transferring effectively across datasets.
翻译:大型语言模型通过扩展推理时计算获得了强大的推理能力,但其本质上仍是无状态的,会丢弃在此过程中产生的丰富自生成推理轨迹。我们探究模型是否能够从这些经验中在线学习,将瞬时计算(推理轨迹)转化为持久可复用的知识,且无需外部监督或访问未来数据。实验表明,对原始推理轨迹进行上下文学习(In-Context Learning, ICL)无法实现泛化,这反映了标记级复用的根本性局限:单个轨迹缺乏迁移所需的抽象性,即便经过优化(如自我反思)亦然。相反,借鉴近期无监督强化学习研究,我们发现利用自生成测试时信号(多数投票)作为奖励的轻量级逐实例训练能带来显著增益,甚至超越全数据集离线训练,这促使研究重心从原始轨迹转向学习到的潜表示。基于此洞察,我们提出一种在线方法,将遇到问题时耗费的推理时计算蒸馏为紧凑的模块化潜记忆,以捕获底层推理结构。这些记忆被存储并在未来输入时检索,通过模块化设计实现持续改进并避免灾难性遗忘。重要的是,本方法极其高效:参数化为极轻量的软提示记忆(约占模型参数的0.001%),仅需少量梯度步训练,却能达到与全参数更新及离线训练相匹敌的性能。在具有挑战性的数学推理基准测试中,我们的方法显著优于零样本和原始数据ICL基线,并能跨数据集有效迁移。