Large language models have achieved remarkable progress on complex reasoning tasks. However, they often implicitly fabricate information when inputs are incomplete, producing confident but unreliable conclusions -- a failure mode we term ungrounded reasoning. We argue that this issue arises not from insufficient reasoning capability, but from the lack of inferential boundary awareness -- the ability to recognize when the necessary premises for valid inference are missing. To address this issue, we propose Grounded Reasoning via Interactive Reinforcement Learning (GRIL), a multi-turn reinforcement learning framework for grounded reasoning under incomplete information. GRIL decomposes the reasoning process into two stages: clarify and pause, which identifies whether the available information is sufficient, and grounded reasoning, which performs task solving once the necessary premises are established. We design stage-specific rewards to penalize hallucinations, enabling models to detect gaps, stop proactively, and resume reasoning after clarification. Experiments on GSM8K-Insufficient and MetaMATH-Insufficient show that GRIL significantly improves premise detection (up to 45%), leading to a 30% increase in task success while reducing average response length by over 20%. Additional analyses confirm robustness to noisy user responses and generalization to out-of-distribution tasks.
翻译:大语言模型在复杂推理任务上取得了显著进展。然而,当输入信息不完整时,它们常常隐式地捏造信息,得出自信但不可靠的结论——我们将这种失效模式称为“无前提推理”。我们认为,这一问题并非源于推理能力不足,而是由于缺乏推理边界意识——即识别进行有效推理所必需的前提是否缺失的能力。为解决这一问题,我们提出基于交互式强化学习的推理规约方法(GRIL),这是一种面向不完整信息下前提推理的多轮强化学习框架。GRIL将推理过程分解为两个阶段:“澄清与暂停”阶段,用于识别现有信息是否充足;“前提推理”阶段,在建立必要前提后执行任务求解。我们设计了分阶段奖励来惩罚幻觉行为,使模型能够检测信息缺口、主动暂停,并在澄清后恢复推理。在GSM8K-Insufficient和MetaMATH-Insufficient上的实验表明,GRIL显著提升了前提检测能力(最高达45%),任务成功率提升30%,同时平均响应长度减少超过20%。额外分析证实了其对噪声用户响应的鲁棒性,以及对分布外任务的泛化能力。