As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose, e.g. using models to review the outputs of other models, or red-teaming techniques to surface subtle failure modes. However, researchers have not evaluated whether such techniques still ensure safety if the model is itself intentionally trying to subvert them. In this paper, we develop and evaluate pipelines of safety techniques ("protocols") that are robust to intentional subversion. We investigate a scenario in which we want to solve a sequence of programming problems, using access to a powerful but untrusted model (in our case, GPT-4), access to a less powerful trusted model (in our case, GPT-3.5), and limited access to high-quality trusted labor. We investigate protocols that aim to never submit solutions containing backdoors, which we operationalize here as logical errors that are not caught by test cases. We investigate a range of protocols and test each against strategies that the untrusted model could use to subvert them. One protocol is what we call trusted editing. This protocol first asks GPT-4 to write code, and then asks GPT-3.5 to rate the suspiciousness of that code. If the code is below some suspiciousness threshold, it is submitted. Otherwise, GPT-3.5 edits the solution to remove parts that seem suspicious and then submits the edited code. Another protocol is untrusted monitoring. This protocol asks GPT-4 to write code, and then asks another instance of GPT-4 whether the code is backdoored, using various techniques to prevent the GPT-4 instances from colluding. These protocols improve substantially on simple baselines.
翻译:随着大型语言模型(LLM)能力日益增强且自主部署范围扩大,防止其造成有害后果变得愈发重要。为此,研究者已探索多种安全技术,例如使用模型审查其他模型的输出,或通过红队测试揭示潜在故障模式。然而,现有研究尚未评估当模型本身有意规避安全措施时,这些技术是否仍能确保安全性。本文开发并评估了能够抵御有意规避行为的安全技术流程(即“协议”)。我们研究以下场景:需要解决一系列编程问题,其中可使用强大的不可信模型(本文采用GPT-4)、能力较弱的可信模型(本文采用GPT-3.5)以及有限的高质量可信人工资源。我们设计了旨在永不提交包含后门代码的协议,此处将后门操作化为测试用例未能捕获的逻辑错误。我们测试了多种协议,并针对不可信模型可能采用的规避策略进行评估。其中一种协议称为可信编辑协议:首先要求GPT-4编写代码,随后要求GPT-3.5评估代码可疑度。若代码低于预设可疑度阈值则直接提交,否则由GPT-3.5编辑代码以移除可疑部分后提交。另一种协议称为不可信监控协议:要求GPT-4编写代码后,使用另一GPT-4实例判断代码是否包含后门,并采用多种技术防止GPT-4实例间串通。实验表明,这些协议相比简单基线方法有显著改进。