Recent studies indicate that when faced with explicit biases in prompts, models often omit mentioning these biases in their Chain-of-Thought (CoT) output, revealing that verbalized reasoning can give an incorrect picture of how models arrive at conclusions (unfaithfulness). In this work, we show that unfaithful CoT also occurs on naturally worded, non-adversarial prompts without adding artificial biases or editing model outputs. We find that when separately presented with the questions "Is X bigger than Y?" and "Is Y bigger than X?", models sometimes produce superficially coherent arguments to justify systematically answering Yes to both or No to both, despite the contradiction. We present preliminary evidence that this is due to models' implicit biases towards Yes or No, labeling this Implicit Post-Hoc Rationalization. Our results reveal rates up to 13% for production models, and while frontier models are more faithful, none are entirely so, including thinking models like DeepSeek R1 (0.37%) and Sonnet 3.7 with thinking (0.04%). We also investigate Unfaithful Illogical Shortcuts, where models use subtly illogical reasoning to make speculative answers to hard math problems seem rigorously proven. Our findings indicate that while CoT can be useful for assessing outputs, it is not a complete account of the internal process that produced the model's answer and should be used with caution in agentic or safety-critical settings.
翻译:近期研究表明,当提示语中存在显式偏见时,模型常在思维链输出中遗漏提及这些偏见,揭示出口头化推理可能错误呈现模型得出结论的方式(不忠实)。本研究发现,即使在使用自然措辞、非对抗性提示且不添加人为偏见或编辑模型输出的情况下,不忠实思维链现象仍然存在。我们观察到,当分别面对"X是否比Y大?"和"Y是否比X大?"问题时,模型有时会产生表面连贯的论证来证明系统性对两者都回答"是"或"否"的合理性,尽管存在矛盾。我们提供了初步证据表明,这是由于模型对"是"或"否"的隐性偏见所致,并将此现象称为"隐式事后合理化"。结果显示,生产模型的不忠实率最高达13%,虽然前沿模型更为忠实,但包括DeepSeek R1(0.37%)和启用思考功能的Sonnet 3.7(0.04%)在内的思考型模型均非完全忠实。我们还研究了"不忠实非逻辑捷径"现象,即模型使用细微的非逻辑推理,使得对困难数学问题的推测性答案看似经过严格论证。我们的发现表明,尽管思维链有助于评估输出,但它并非产生模型答案的内部过程的完整记录,在自主代理或安全关键场景中应谨慎使用。