Chain-of-Thought Reasoning In The Wild Is Not Always Faithful

Recent studies indicate that when faced with explicit biases in prompts, models often omit mentioning these biases in their Chain-of-Thought (CoT) output, revealing that verbalized reasoning can give an incorrect picture of how models arrive at conclusions (unfaithfulness). In this work, we show that unfaithful CoT also occurs on naturally worded, non-adversarial prompts without adding artificial biases or editing model outputs. We find that when separately presented with the questions "Is X bigger than Y?" and "Is Y bigger than X?", models sometimes produce superficially coherent arguments to justify systematically answering Yes to both or No to both, despite the contradiction. We present preliminary evidence that this is due to models' implicit biases towards Yes or No, labeling this Implicit Post-Hoc Rationalization. Our results reveal rates up to 13% for production models, and while frontier models are more faithful, none are entirely so, including thinking models like DeepSeek R1 (0.37%) and Sonnet 3.7 with thinking (0.04%). We also investigate Unfaithful Illogical Shortcuts, where models use subtly illogical reasoning to make speculative answers to hard math problems seem rigorously proven. Our findings indicate that while CoT can be useful for assessing outputs, it is not a complete account of the internal process that produced the model's answer and should be used with caution in agentic or safety-critical settings.

翻译：近期研究表明，当提示语中存在显式偏见时，模型常在思维链输出中遗漏提及这些偏见，揭示出口头化推理可能错误呈现模型得出结论的方式（不忠实）。本研究发现，即使在使用自然措辞、非对抗性提示且不添加人为偏见或编辑模型输出的情况下，不忠实思维链现象仍然存在。我们观察到，当分别面对"X是否比Y大？"和"Y是否比X大？"问题时，模型有时会产生表面连贯的论证来证明系统性对两者都回答"是"或"否"的合理性，尽管存在矛盾。我们提供了初步证据表明，这是由于模型对"是"或"否"的隐性偏见所致，并将此现象称为"隐式事后合理化"。结果显示，生产模型的不忠实率最高达13%，虽然前沿模型更为忠实，但包括DeepSeek R1（0.37%）和启用思考功能的Sonnet 3.7（0.04%）在内的思考型模型均非完全忠实。我们还研究了"不忠实非逻辑捷径"现象，即模型使用细微的非逻辑推理，使得对困难数学问题的推测性答案看似经过严格论证。我们的发现表明，尽管思维链有助于评估输出，但它并非产生模型答案的内部过程的完整记录，在自主代理或安全关键场景中应谨慎使用。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

从感知到推理：深度思考赋能多模态大语言模型

专知会员服务

25+阅读 · 2025年11月19日

【NeurIPS2025】语言模型是高效的推理者吗？——来自逻辑编程的视角

专知会员服务

17+阅读 · 2025年11月3日

《潜在推理综述》

专知会员服务

21+阅读 · 2025年7月9日

超越语言的推理：潜在思维链推理的综合综述

专知会员服务

22+阅读 · 2025年5月23日