Think-Answer reasoners such as DeepSeek-R1 have made notable progress by leveraging interpretable internal reasoning. However, despite the frequent presence of self-reflective cues like "Oops!", they remain vulnerable to output errors during single-pass inference. To address this limitation, we propose an efficient Recursive Think-Answer Process (R-TAP) that enables models to engage in iterative reasoning cycles and generate more accurate answers, going beyond conventional single-pass approaches. Central to this approach is a confidence generator that evaluates the certainty of model responses and guides subsequent improvements. By incorporating two complementary rewards-Recursively Confidence Increase Reward and Final Answer Confidence Reward-we show that R-TAP-enhanced models consistently outperform conventional single-pass methods for both large language models (LLMs) and vision-language models (VLMs). Moreover, by analyzing the frequency of "Oops"-like expressions in model responses, we find that R-TAP-applied models exhibit significantly fewer self-reflective patterns, resulting in more stable and faster inference-time reasoning. We hope R-TAP pave the way evolving into efficient and elaborated methods to refine the reasoning processes of future AI.
翻译:诸如DeepSeek-R1等思考-应答推理器通过利用可解释的内部推理机制已取得显著进展。然而,尽管其输出中频繁出现"Oops!"等自我反思提示,这些模型在单次推理过程中仍易产生输出错误。为克服此局限,我们提出一种高效的递归思考-应答过程(R-TAP),使模型能够进行迭代推理循环并生成更准确的答案,从而超越传统的单次推理方法。该方法的核心理念是通过置信度生成器评估模型响应的确定性,并指导后续改进。通过引入两种互补奖励机制——递归置信度增长奖励与最终答案置信度奖励——我们证明经R-TAP增强的模型在大型语言模型(LLMs)和视觉语言模型(VLMs)任务中均持续优于传统单次推理方法。此外,通过分析模型响应中"Oops"类表达的出现频率,我们发现应用R-TAP的模型展现出显著减少的自我反思模式,从而实现更稳定、更快速的推理过程。我们期待R-TAP能为开发高效精细的推理优化方法开辟道路,推动未来人工智能推理过程的演进。