Large Reasoning Models (LRMs) excel at multi-step reasoning but often suffer from inefficient reasoning processes like overthinking and overshoot, where excessive or misdirected reasoning increases computational cost and degrades performance. Existing efficient reasoning methods operate in a closed-loop manner, lacking mechanisms for external intervention to guide the reasoning process. To address this, we propose Think-with-Me, a novel test-time interactive reasoning paradigm that introduces external feedback intervention into the reasoning process. Our key insights are that transitional conjunctions serve as natural points for intervention, signaling phases of self-validation or exploration and using transitional words appropriately to prolong the reasoning enhances performance, while excessive use affects performance. Building on these insights, Think-with-Me pauses reasoning at these points for external feedback, adaptively extending or terminating reasoning to reduce redundancy while preserving accuracy. The feedback is generated via a multi-criteria evaluation (rationality and completeness) and comes from either human or LLM proxies. We train the target model using Group Relative Policy Optimization (GRPO) to adapt to this interactive mode. Experiments show that Think-with-Me achieves a superior balance between accuracy and reasoning length under limited context windows. On AIME24, Think-with-Me outperforms QwQ-32B by 7.19% in accuracy while reducing average reasoning length by 81% under an 8K window. The paradigm also benefits security and creative tasks.
翻译:大型推理模型(LRMs)在多步推理任务中表现出色,但常因推理过程低效而受限,例如过度思考与推理超调——即过度或偏离方向的推理会增加计算成本并降低性能。现有高效推理方法以闭环方式运行,缺乏外部干预机制来引导推理过程。为此,我们提出“与我共思”(Think-with-Me),一种新颖的测试时交互式推理范式,将外部反馈干预引入推理过程。我们的核心洞见是:转折连词可作为自然的干预节点,标志着自我验证或探索阶段;恰当使用转折词延长推理能提升性能,而过度使用则会影响表现。基于这些发现,Think-with-Me 在这些节点暂停推理以接收外部反馈,自适应地延长或终止推理,从而在保持准确性的同时减少冗余。反馈通过多标准评估(合理性与完备性)生成,可来源于人类或大语言模型代理。我们使用群体相对策略优化(GRPO)训练目标模型,使其适应这种交互模式。实验表明,在有限上下文窗口下,Think-with-Me 在准确性与推理长度间实现了更优的平衡。在 AIME24 数据集上,Think-with-Me 在 8K 窗口条件下准确率超越 QwQ-32B 达 7.19%,同时平均推理长度减少 81%。该范式亦有益于安全性与创造性任务。