Language models can use verifiable rewards to improve at a wide variety of reasoning tasks. However, both parametric (e.g. RLVR) and non-parametric (e.g. prompt optimization) approaches to doing so typically require hundreds of training samples and thousands of model rollouts, making them expensive in the best case and intractable in the worst. To address this challenge, we introduce Contrastive Reflection (CORE), a non-parametric learning algorithm that compares past reasoning traces to generate insights: short natural-language descriptions of reasoning strategies and constraints that capture differences between successful and unsuccessful problem attempts. Across four reasoning tasks, we demonstrate that CORE enables more rapid improvement than both parametric (GRPO) and non-parametric (GEPA, episodic RAG, and MemRL) methods, while using fewer rollouts. Under fixed rollout budgets with as few as five training samples, CORE achieves the strongest performance in most task-data regimes. Finally, we highlight how CORE is substantially more context-efficient than non-parametric baselines, requiring fewer prompt tokens while storing learned knowledge as compact, interpretable natural-language insights. Our results therefore suggest that distilling contrasts between successful and unsuccessful reasoning traces into abstract and useful insights can provide a more efficient and interpretable route to model self-improvement than weight updates, prompt optimization, or direct reuse of stored reasoning traces.
翻译:摘要:语言模型可利用可验证的奖励信号在多种推理任务上实现改进。然而,无论是参数化方法(如RLVR)还是非参数化方法(如提示优化),其实现过程通常需要数百个训练样本和数千次模型生成,这在最理想的情况下成本高昂,在最差的情况下甚至难以处理。为应对这一挑战,我们提出了对比反思(CORE),一种非参数化学习算法,通过比较过往的推理轨迹以生成洞见:即用简短的、自然语言描述的推理策略与约束规则,捕捉成功与失败解题尝试之间的差异。在四项推理任务上的实验表明,与参数化方法(GRPO)及非参数化方法(GEPA、情景式RAG、MemRL)相比,CORE能在使用更少生成次数的情况下实现更快速的改进。在固定生成预算下(即使训练样本少至五个),CORE在多数任务-数据组合中均取得了最优性能。最后,我们强调CORE在上下文效率上显著优于非参数化基线方法:它仅需更少的提示词元,同时将习得知识以紧凑、可解释的自然语言洞见形式进行存储。因此,我们的研究表明,相较于权重更新、提示优化或直接重用存储的推理轨迹,将成功与失败推理轨迹之间的对比结果提炼为抽象且有益的洞见,能为语言模型的自我改进提供一条更高效、更具可解释性的路径。