With the widespread adoption of large language models (LLMs), hallucinations, which are non-factual fabrications in model outputs, have become serious concerns. Reasoning capabilities have received attention as a self-verification process to improve output reliability. However, the effect of reasoning within a closed system where LLMs cannot rely on external tools or knowledge has yet to be clarified. We therefore conduct experiments under strict constraints (recommending peer-reviewed journal articles in computer science) to examine the effect of reasoning across multiple models (GPT-5.2 and Gemini 3 Flash). Our results reveal a problematic trade-off between constraint compliance and factual accuracy. Non-reasoning models exhibit high constraint violation rates (66-75%) but maintain factual accuracy, while reasoning models reduce violations (13-26%) but systematically distort known facts to satisfy constraints and increase complete fabrication. This trade-off pattern is consistent across both models despite different architectures, indicating a fundamental limitation of reasoning. Furthermore, reasoning does not uniformly improve output authenticity: effects diverge by model, reflecting different allocations of the compliance-truthfulness trade-off. These findings challenge the assumption that reasoning universally improves reliability: reasoning models trade honest constraint violations for detection-resistant distortions.
翻译:随着大语言模型的广泛应用,模型输出中非事实性虚构的"幻觉"现象已成为严重问题。推理能力作为一种提升输出可靠性的自我验证机制受到关注。然而,在无法依赖外部工具或知识的封闭系统中,推理的效果尚未明确。为此,我们在严格约束条件下(推荐计算机科学领域的同行评审期刊论文)开展实验,考察多种模型(GPT-5.2与Gemini 3 Flash)的推理效果。研究结果揭示了约束遵循与事实准确性之间存在问题性权衡:非推理模型虽表现出较高的约束违反率(66-75%),但能保持事实准确性;而推理模型虽降低违反率(13-26%),却会系统性地扭曲已知事实以满足约束条件,并增加完全虚构内容。这种权衡模式在不同架构的模型中表现一致,表明推理存在根本性局限。此外,推理并不能普遍提升输出真实性:其效果因模型而异,反映出在遵循约束与保持真实之间不同的权衡分配。这些发现挑战了"推理能普遍提升可靠性"的假设:推理模型实际上是以隐蔽的失真为代价,换取可检测的约束违反行为的减少。