The Reasoning Trap: How Enhancing LLM Reasoning Amplifies Tool Hallucination

Enhancing the reasoning capabilities of Large Language Models (LLMs) is a key strategy for building Agents that "think then act." However, recent observations, like OpenAI's o3, suggest a paradox: stronger reasoning often coincides with increased hallucination, yet no prior work has systematically examined whether reasoning enhancement itself causes tool hallucination. To address this gap, we pose the central question: Does strengthening reasoning increase tool hallucination? To answer this, we introduce SimpleToolHalluBench, a diagnostic benchmark measuring tool hallucination in two failure modes: (i) no tool available, and (ii) only distractor tools available. Through controlled experiments, we establish three key findings. First, we demonstrate a causal relationship: progressively enhancing reasoning through RL increases tool hallucination proportionally with task performance gains. Second, this effect transcends overfitting - training on non-tool tasks (e.g., mathematics) still amplifies subsequent tool hallucination. Third, the effect is method-agnostic, appearing when reasoning is instilled via supervised fine-tuning and when it is merely elicited at inference by switching from direct answers to step-by-step thinking. We also evaluate mitigation strategies including Prompt Engineering and Direct Preference Optimization (DPO), revealing a fundamental reliability-capability trade-off: reducing hallucination consistently degrades utility. Mechanistically, Reasoning RL disproportionately collapses tool-reliability-related representations, and hallucinations surface as amplified divergences concentrated in late-layer residual streams. These findings reveal that current reasoning enhancement methods inherently amplify tool hallucination, highlighting the need for new training objectives that jointly optimize for capability and reliability.

翻译：增强大语言模型（LLM）的推理能力是构建“先思考后行动”智能体的关键策略。然而，近期观察到OpenAI的o3等模型出现了一个悖论：更强的推理能力往往伴随着更严重的幻觉现象，但尚无系统性研究探讨推理增强本身是否会导致工具幻觉。为填补这一空白，我们提出核心问题：强化推理能力是否会加剧工具幻觉？为此，我们构建了SimpleToolHalluBench诊断基准，在两个失效模式下测量工具幻觉：（i）无可用工具，以及（ii）仅存在干扰工具。通过控制实验，我们确立了三个核心发现：第一，我们证明了因果关系——通过强化学习逐步增强推理能力时，工具幻觉会随任务性能增益成比例增加；第二，该效应超越过拟合范畴——即使对非工具任务（如数学）进行训练，仍会放大后续工具幻觉；第三，该效应与训练方法无关——无论通过监督微调灌输推理能力，还是在推理阶段通过从直接回答切换为逐步思考来激发推理能力，均会出现此现象。我们还评估了提示工程和直接偏好优化（DPO）等缓解策略，揭示了可靠性与能力之间的根本性权衡：减少幻觉会持续降低实用性。从机理上看，推理强化学习会过度压缩与工具可靠性相关的表征，而幻觉表现为集中在浅层残差流中的放大分歧。这些发现表明，当前推理增强方法会固有地放大工具幻觉，凸显了需要兼顾能力与可靠性的新型训练目标。