Chain-of-thought (CoT) reasoning provides a significant performance uplift to LLMs by enabling planning, exploration, and deliberation of their actions. CoT is also a powerful tool for monitoring the behaviours of these agents: when faithful, they offer interpretations of the model's decision making process, and an early warning sign for dangerous behaviours. However, optimisation pressures placed on the CoT may cause the model to obfuscate reasoning traces, losing this beneficial property. We show that obfuscation can generalise across tasks; models that learn to obfuscate reasoning involving reward hacking (e.g. accessing and utilising leaked information) generalise both the reward hacking behaviour and its obfuscation in CoT to unseen reward hacking settings. Most worryingly, we show that obfuscation of CoT reasoning, and its generalisation across tasks, also follows when we penalise only the model's final actions after closing its CoT. Our findings suggest that current practices of penalising harmful generations may inadvertently lead to a reduction in the broader monitorability of LLMs in unpredictable ways.
翻译:思维链(CoT)推理通过使大语言模型能够规划、探索和深思其行动,为其性能带来显著提升。CoT也是监控这些智能体行为的强大工具:当具备忠实性时,它们能提供模型决策过程的解释,并对危险行为发出早期预警。然而,施加于CoT的优化压力可能导致模型混淆推理轨迹,从而丧失这一有益特性。我们证明混淆行为可跨任务泛化:在涉及奖励攻击(例如获取并利用泄露信息)的推理中学会混淆的模型,会将奖励攻击行为及其在CoT中的混淆同时泛化至未见过的奖励攻击场景。最令人担忧的是,我们发现即使仅在模型完成CoT后对其最终行动进行惩罚,CoT推理的混淆及其跨任务泛化现象依然会出现。我们的研究结果表明,当前惩罚有害生成内容的实践可能以不可预测的方式,无意中导致大语言模型整体可监控性的降低。