Reasoning While Asking: Transforming Reasoning Large Language Models from Passive Solvers to Proactive Inquirers

Reasoning-oriented Large Language Models (LLMs) have achieved remarkable progress with Chain-of-Thought (CoT) prompting, yet they remain fundamentally limited by a \emph{blind self-thinking} paradigm: performing extensive internal reasoning even when critical information is missing or ambiguous. We propose Proactive Interactive Reasoning (PIR), a new reasoning paradigm that transforms LLMs from passive solvers into proactive inquirers that interleave reasoning with clarification. Unlike existing search- or tool-based frameworks that primarily address knowledge uncertainty by querying external environments, PIR targets premise- and intent-level uncertainty through direct interaction with the user. PIR is implemented via two core components: (1) an uncertainty-aware supervised fine-tuning procedure that equips models with interactive reasoning capability, and (2) a user-simulator-based policy optimization framework driven by a composite reward that aligns model behavior with user intent. Extensive experiments on mathematical reasoning, code generation, and document editing demonstrate that PIR consistently outperforms strong baselines, achieving up to 32.70\% higher accuracy, 22.90\% higher pass rate, and 41.36 BLEU improvement, while reducing nearly half of the reasoning computation and unnecessary interaction turns. Further reliability evaluations on factual knowledge, question answering, and missing-premise scenarios confirm the strong generalization and robustness of PIR. Model and code are publicly available at: \href{https://github.com/SUAT-AIRI/Proactive-Interactive-R1}

翻译：面向推理的大语言模型（LLMs）通过思维链提示取得了显著进展，但其本质上仍受限于一种“盲目的自我思考”范式：即使在关键信息缺失或模糊的情况下，仍进行大量内部推理。我们提出主动交互式推理（PIR），这是一种新的推理范式，它将LLMs从被动求解器转变为主动询问者，使其能够在推理过程中穿插澄清性提问。与现有主要通过查询外部环境来解决知识不确定性的搜索或工具型框架不同，PIR通过与用户直接交互，针对前提和意图层面的不确定性。PIR通过两个核心组件实现：（1）一种不确定性感知的监督微调过程，使模型具备交互式推理能力；（2）一个基于用户模拟器的策略优化框架，该框架由复合奖励驱动，旨在使模型行为与用户意图对齐。在数学推理、代码生成和文档编辑任务上的大量实验表明，PIR始终优于强基线模型，实现了高达32.70%的准确率提升、22.90%的通过率提升以及41.36 BLEU分数的改进，同时减少了近一半的推理计算量和不必要的交互轮次。在事实知识、问答和前提缺失场景下的进一步可靠性评估证实了PIR强大的泛化能力和鲁棒性。模型和代码已公开于：\href{https://github.com/SUAT-AIRI/Proactive-Interactive-R1}