Hint-based faithfulness evaluations have established that Large Reasoning Models (LRMs) may not say what they think: they do not always volunteer information about how key parts of the input (e.g. answer hints) influence their reasoning. Yet, these evaluations also fail to specify what models should do when confronted with hints or other unusual prompt content -- even though versions of such instructions are standard security measures (e.g. for countering prompt injections). Here, we study faithfulness under this more realistic setting in which models are explicitly alerted to the possibility of unusual inputs. We find that such instructions can yield strong results on faithfulness metrics from prior work. However, results on new, more granular metrics proposed in this work paint a mixed picture: although models may acknowledge the presence of hints, they will often deny intending to use them -- even when permitted to use hints and even when it can be demonstrated that they are using them. Our results thus raise broader challenges for CoT monitoring and interpretability.
翻译:基于提示的忠实性评估已确定,大型推理模型(LRMs)可能不会说出它们的真实想法:它们并非总是主动提供关于输入关键部分(例如答案提示)如何影响其推理的信息。然而,这些评估也未能明确模型在面对提示或其他异常提示内容时应采取何种行为——尽管此类指令的某些版本是标准的安全措施(例如,用于防范提示注入)。本文在此更真实的场景下研究忠实性,即模型被明确提醒可能遇到异常输入。我们发现,此类指令可在先前工作提出的忠实性指标上取得显著效果。然而,本文提出的新的、更细粒度的指标所呈现的结果好坏参半:尽管模型可能承认提示的存在,但它们往往否认有意使用这些提示——即便被允许使用提示,甚至在被证明正在使用提示时也是如此。因此,我们的结果为思维链监控与可解释性提出了更广泛的挑战。