The AI Control research agenda aims to develop control protocols: safety techniques that prevent untrusted AI systems from taking harmful actions during deployment. Because human oversight is expensive, one approach is trusted monitoring, where weaker, trusted models oversee stronger, untrusted models$\unicode{x2013}$but this often fails when the untrusted model's actions exceed the monitor's comprehension. We introduce legibility protocols, which encourage the untrusted model to take actions that are easier for a monitor to evaluate. We perform control evaluations in the APPS coding setting, where an adversarial agent attempts to write backdoored code without detection. We study legibility protocols that allow the untrusted model to thoroughly document its code with comments$\unicode{x2013}$in contrast to prior work, which removed comments to prevent deceptive ones. We find that: (i) commenting protocols improve safety without sacrificing task performance relative to comment-removal baselines; (ii) commenting disproportionately benefits honest code, which typically has a natural explanation that resolves monitor suspicion, whereas backdoored code frequently lacks an easy justification; (iii) gains from commenting increase with monitor strength, as stronger monitors better distinguish genuine justifications from only superficially plausible ones.
翻译:AI控制研究议程旨在开发控制协议:即在部署过程中防止不可信AI系统采取有害行为的安全技术。由于人工监督成本高昂,一种可行方案是可信监控,即由较弱的可信模型监督较强的不可信模型——但当不可信模型的行为超出监控者的理解范围时,该方法往往失效。本文提出可读性协议,该协议促使不可信模型采取更易于监控者评估的行为。我们在APPS编程环境中进行控制评估,其中对抗性智能体试图编写后门代码而不被检测。我们研究的可读性协议允许不可信模型通过注释对其代码进行详尽文档化——这与先前为防范欺骗性注释而删除注释的研究形成对比。研究发现:(i)相较于注释删除基线,注释协议能在不牺牲任务性能的前提下提升安全性;(ii)注释对诚实代码的益处尤为显著,这类代码通常具有可消除监控疑虑的自然解释,而后门代码往往缺乏易于辩护的合理依据;(iii)注释带来的效益随监控强度提升而增加,因为更强的监控器能更好地区分真实解释与仅具表面合理性的辩解。