Large reasoning models achieve strong performance by scaling inference-time chain-of-thought, but this paradigm suffers from quadratic cost, context length limits, and degraded reasoning due to lost-in-the-middle effects. Iterative reasoning mitigates these issues by periodically summarizing intermediate thoughts, yet existing methods rely on supervised learning or fixed heuristics and fail to optimize when to summarize, what to preserve, and how to resume reasoning. We propose InftyThink+, an end-to-end reinforcement learning framework that optimizes the entire iterative reasoning trajectory, building on model-controlled iteration boundaries and explicit summarization. InftyThink+ adopts a two-stage training scheme with supervised cold-start followed by trajectory-level reinforcement learning, enabling the model to learn strategic summarization and continuation decisions. Experiments on DeepSeek-R1-Distill-Qwen-1.5B show that InftyThink+ improves accuracy by 21% on AIME24 and outperforms conventional long chain-of-thought reinforcement learning by a clear margin, while also generalizing better to out-of-distribution benchmarks. Moreover, InftyThink+ significantly reduces inference latency and accelerates reinforcement learning training, demonstrating improved reasoning efficiency alongside stronger performance.
翻译:大型推理模型通过扩展推理时思维链实现了强劲性能,但该范式存在二次方成本、上下文长度限制以及因“中间丢失”效应导致的推理质量下降等问题。迭代式推理通过周期性总结中间思考缓解了这些问题,然而现有方法依赖于监督学习或固定启发式规则,未能优化何时总结、保留何种信息以及如何恢复推理。我们提出了InftyThink+,一种端到端的强化学习框架,该框架基于模型控制的迭代边界和显式摘要机制,优化整个迭代推理轨迹。InftyThink+采用两阶段训练方案:先进行监督式冷启动,再进行轨迹级强化学习,使模型能够学习策略性摘要与推理续接决策。在DeepSeek-R1-Distill-Qwen-1.5B上的实验表明,InftyThink+在AIME24上将准确率提升了21%,明显优于传统的长思维链强化学习方法,同时在分布外基准测试中展现出更好的泛化能力。此外,InftyThink+显著降低了推理延迟并加速了强化学习训练,在提升性能的同时证明了其推理效率的改进。