AI Control: Improving Safety Despite Intentional Subversion

As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose, e.g. using models to review the outputs of other models, or red-teaming techniques to surface subtle failure modes. However, researchers have not evaluated whether such techniques still ensure safety if the model is itself intentionally trying to subvert them. In this paper, we develop and evaluate pipelines of safety techniques ("protocols") that are robust to intentional subversion. We investigate a scenario in which we want to solve a sequence of programming problems, using access to a powerful but untrusted model (in our case, GPT-4), access to a less powerful trusted model (in our case, GPT-3.5), and limited access to high-quality trusted labor. We investigate protocols that aim to never submit solutions containing backdoors, which we operationalize here as logical errors that are not caught by test cases. We investigate a range of protocols and test each against strategies that the untrusted model could use to subvert them. One protocol is what we call trusted editing. This protocol first asks GPT-4 to write code, and then asks GPT-3.5 to rate the suspiciousness of that code. If the code is below some suspiciousness threshold, it is submitted. Otherwise, GPT-3.5 edits the solution to remove parts that seem suspicious and then submits the edited code. Another protocol is untrusted monitoring. This protocol asks GPT-4 to write code, and then asks another instance of GPT-4 whether the code is backdoored, using various techniques to prevent the GPT-4 instances from colluding. These protocols improve substantially on simple baselines.

翻译：随着大型语言模型（LLM）能力日益增强且自主部署范围扩大，防止其造成有害后果变得愈发重要。为此，研究者已探索多种安全技术，例如使用模型审查其他模型的输出，或通过红队测试揭示潜在故障模式。然而，现有研究尚未评估当模型本身有意规避安全措施时，这些技术是否仍能确保安全性。本文开发并评估了能够抵御有意规避行为的安全技术流程（即“协议”）。我们研究以下场景：需要解决一系列编程问题，其中可使用强大的不可信模型（本文采用GPT-4）、能力较弱的可信模型（本文采用GPT-3.5）以及有限的高质量可信人工资源。我们设计了旨在永不提交包含后门代码的协议，此处将后门操作化为测试用例未能捕获的逻辑错误。我们测试了多种协议，并针对不可信模型可能采用的规避策略进行评估。其中一种协议称为可信编辑协议：首先要求GPT-4编写代码，随后要求GPT-3.5评估代码可疑度。若代码低于预设可疑度阈值则直接提交，否则由GPT-3.5编辑代码以移除可疑部分后提交。另一种协议称为不可信监控协议：要求GPT-4编写代码后，使用另一GPT-4实例判断代码是否包含后门，并采用多种技术防止GPT-4实例间串通。实验表明，这些协议相比简单基线方法有显著改进。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日