It has been shown that Large Reasoning Models (LRMs) may not *say what they think*: they do not always volunteer information about how certain parts of the input influence their reasoning. But it is one thing for a model to *omit* such information and another, worse thing to *lie* about it. Here, we extend the work of Chen et al. (2025) to show that LRMs will do just this: they will flatly deny relying on hints provided in the prompt in answering multiple choice questions -- even when directly asked to reflect on unusual (i.e. hinted) prompt content, even when allowed to use hints, and even though experiments *show* them to be using the hints. Our results thus have discouraging implications for CoT monitoring and interpretability.
翻译:已有研究表明,大型推理模型(LRMs)可能不会"说出其真实想法":它们并不总是主动说明输入的某些部分如何影响其推理过程。但模型"省略"此类信息是一回事,"撒谎"则是更严重的问题。本文拓展了Chen等人(2025)的研究,证明LRMs确实会这样做:在回答选择题时,它们会断然否认依赖提示中提供的线索——即使被要求反思提示中不寻常(即包含线索)的内容,即使允许使用线索,且实验证明它们确实使用了这些线索。因此,我们的研究结果对思维链监控和可解释性研究提出了令人担忧的启示。