AI Control: Improving Safety Despite Intentional Subversion

As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose, e.g. using models to review the outputs of other models, or red-teaming techniques to surface subtle failure modes. However, researchers have not evaluated whether such techniques still ensure safety if the model is itself intentionally trying to subvert them. In this paper, we develop and evaluate pipelines of safety techniques ("protocols") that are robust to intentional subversion. We investigate a scenario in which we want to solve a sequence of programming problems, using access to a powerful but untrusted model (in our case, GPT-4), access to a less powerful trusted model (in our case, GPT-3.5), and limited access to high-quality trusted labor. We investigate protocols that aim to never submit solutions containing backdoors, which we operationalize here as logical errors that are not caught by test cases. We investigate a range of protocols and test each against strategies that the untrusted model could use to subvert them. One protocol is what we call trusted editing. This protocol first asks GPT-4 to write code, and then asks GPT-3.5 to rate the suspiciousness of that code. If the code is below some suspiciousness threshold, it is submitted. Otherwise, GPT-3.5 edits the solution to remove parts that seem suspicious and then submits the edited code. Another protocol is untrusted monitoring. This protocol asks GPT-4 to write code, and then asks another instance of GPT-4 whether the code is backdoored, using various techniques to prevent the GPT-4 instances from colluding. These protocols improve substantially on simple baselines.

翻译：随着大语言模型（LLMs）变得更加强大并被更自主地部署，防止其造成有害结果将愈发重要。研究人员已为此探索了多种安全技术，例如使用模型审查其他模型的输出，或采用红队技术暴露微妙的故障模式。然而，当模型本身试图蓄意颠覆这些技术时，其是否仍能确保安全尚未得到评估。本文开发并评估了对意图性颠覆具有鲁棒性的安全技术管线（"协议"）。我们研究了一个场景：需要解决一系列编程问题，同时拥有一个强大但不可信的模型（本文使用GPT-4）、一个较弱但可信的模型（本文使用GPT-3.5），以及有限的高质量可信劳动力。我们研究的协议旨在避免提交包含后门的解决方案——此处将后门操作化定义为未被测试用例捕获的逻辑错误。我们探索了一系列协议，并针对每个协议测试了不可信模型可能采用的颠覆策略。其中一种协议称为"可信编辑"：该协议首先要求GPT-4编写代码，然后要求GPT-3.5评估该代码的可疑程度。若代码低于某个可疑阈值则直接提交；否则，GPT-3.5编辑解决方案以移除可疑部分后提交编辑后的代码。另一种协议是"不可信监控"：该协议要求GPT-4编写代码，随后由另一个GPT-4实例通过多种防止串通的技术判断代码是否包含后门。这些协议相较于简单基线方案有显著改进。