Recent findings suggest that misaligned models may exhibit deceptive behavior, raising concerns about output trustworthiness. Chain-of-thought (CoT) is a promising tool for alignment monitoring: when models articulate their reasoning faithfully, monitors can detect and mitigate harmful behaviors before undesirable outcomes occur. However, a key uncertainty is: Can models obfuscate their CoT in order to pursue hidden adversarial objectives while evading detection? To answer this question and thus stress-test CoT monitorability, we develop a composable and quantifiable taxonomy of prompts to elicit CoT obfuscation. We evaluate both internal CoT (reasoning traces) and external CoT (prompted reasoning in outputs) using toy tasks and more realistic environments in SHADE-Arena. We show that: (i) CoT monitoring performs accurately and efficiently without obfuscation pressure. (ii) Under strong obfuscation pressure, some models successfully complete adversarial tasks while evading detection. (iii) Models do not obfuscate their internal CoT as much as their external CoT (under prompt pressure). These results suggest that while CoT provides valuable oversight in benign settings, robust deployment requires model-specific stress-testing of monitorability.
翻译:近期研究表明,未对齐模型可能表现出欺骗性行为,这引发了对输出可信度的担忧。思维链(CoT)是一种有前景的对齐监测工具:当模型忠实地阐述其推理过程时,监测器可以在不良后果发生前检测并缓解有害行为。然而,一个关键的不确定性在于:模型能否为了追求隐藏的对抗性目标而混淆其思维链,从而逃避检测?为回答此问题并由此对思维链可监测性进行压力测试,我们开发了一种可组合且可量化的提示分类法,用以诱发思维链混淆。我们使用玩具任务和SHADE-Arena中更真实的环境,同时评估了内部思维链(推理轨迹)和外部思维链(输出中经提示的推理)。研究表明:(i)在没有混淆压力的情况下,思维链监测能准确高效地执行。(ii)在强混淆压力下,部分模型能成功完成对抗性任务并逃避检测。(iii)模型对其内部思维链的混淆程度低于对外部思维链的混淆程度(在提示压力下)。这些结果表明,尽管思维链在良性场景中提供了有价值的监督,但实现稳健部署需要对模型的可监测性进行针对性的压力测试。