Enhancing the reasoning capabilities of Large Language Models (LLMs) is a key strategy for building Agents that "think then act." However, recent observations, like OpenAI's o3, suggest a paradox: stronger reasoning often coincides with increased hallucination, yet no prior work has systematically examined whether reasoning enhancement itself causes tool hallucination. To address this gap, we pose the central question: Does strengthening reasoning increase tool hallucination? To answer this, we introduce SimpleToolHalluBench, a diagnostic benchmark measuring tool hallucination in two failure modes: (i) no tool available, and (ii) only distractor tools available. Through controlled experiments, we establish three key findings. First, we demonstrate a causal relationship: progressively enhancing reasoning through RL increases tool hallucination proportionally with task performance gains. Second, this effect transcends overfitting - training on non-tool tasks (e.g., mathematics) still amplifies subsequent tool hallucination. Third, the effect is method-agnostic, appearing when reasoning is instilled via supervised fine-tuning and when it is merely elicited at inference by switching from direct answers to step-by-step thinking. We also evaluate mitigation strategies including Prompt Engineering and Direct Preference Optimization (DPO), revealing a fundamental reliability-capability trade-off: reducing hallucination consistently degrades utility. Mechanistically, Reasoning RL disproportionately collapses tool-reliability-related representations, and hallucinations surface as amplified divergences concentrated in late-layer residual streams. These findings reveal that current reasoning enhancement methods inherently amplify tool hallucination, highlighting the need for new training objectives that jointly optimize for capability and reliability.
翻译:增强大语言模型(LLM)的推理能力是构建“先思考后行动”智能体的关键策略。然而,近期观察到OpenAI的o3等模型出现了一个悖论:更强的推理能力往往伴随着更严重的幻觉现象,但尚无系统性研究探讨推理增强本身是否会导致工具幻觉。为填补这一空白,我们提出核心问题:强化推理能力是否会加剧工具幻觉?为此,我们构建了SimpleToolHalluBench诊断基准,在两个失效模式下测量工具幻觉:(i)无可用工具,以及(ii)仅存在干扰工具。通过控制实验,我们确立了三个核心发现:第一,我们证明了因果关系——通过强化学习逐步增强推理能力时,工具幻觉会随任务性能增益成比例增加;第二,该效应超越过拟合范畴——即使对非工具任务(如数学)进行训练,仍会放大后续工具幻觉;第三,该效应与训练方法无关——无论通过监督微调灌输推理能力,还是在推理阶段通过从直接回答切换为逐步思考来激发推理能力,均会出现此现象。我们还评估了提示工程和直接偏好优化(DPO)等缓解策略,揭示了可靠性与能力之间的根本性权衡:减少幻觉会持续降低实用性。从机理上看,推理强化学习会过度压缩与工具可靠性相关的表征,而幻觉表现为集中在浅层残差流中的放大分歧。这些发现表明,当前推理增强方法会固有地放大工具幻觉,凸显了需要兼顾能力与可靠性的新型训练目标。