As large language models (LLMs) become increasingly capable, it is prudent to assess whether safety measures remain effective even if LLMs intentionally try to bypass them. Previous work introduced control evaluations, an adversarial framework for testing deployment strategies of untrusted models (i.e., models which might be trying to bypass safety measures). While prior work treats a single failure as unacceptable, we perform control evaluations in a "distributed threat setting" -- a setting where no single action is catastrophic and no single action provides overwhelming evidence of misalignment. We approach this problem with a two-level deployment framework that uses an adaptive macro-protocol to choose between micro-protocols. Micro-protocols operate on a single task, using a less capable, but extensively tested (trusted) model to harness and monitor the untrusted model. Meanwhile, the macro-protocol maintains an adaptive credence on the untrusted model's alignment based on its past actions, using it to pick between safer and riskier micro-protocols. We evaluate our method in a code generation testbed where a red team attempts to generate subtly backdoored code with an LLM whose deployment is safeguarded by a blue team. We plot Pareto frontiers of safety (# of non-backdoored solutions) and usefulness (# of correct solutions). At a given level of usefulness, our adaptive deployment strategy reduces the number of backdoors by 80% compared to non-adaptive baselines.
翻译:随着大语言模型(LLM)能力日益增强,有必要评估安全措施在LLM蓄意规避时是否依然有效。先前研究引入了控制评估——一种用于测试不可信模型(即可能试图规避安全措施的模型)部署策略的对抗性框架。以往工作将单次失败视为不可接受,而我们在“分布式威胁场景”下进行控制评估——该场景中单次行动不会造成灾难性后果,且单次行动也无法提供决定性的模型失准证据。我们采用双层部署框架解决该问题:通过自适应宏观协议在微观协议间进行选择。微观协议作用于单一任务,使用能力较弱但经过充分测试的可信模型来利用并监控不可信模型;与此同时,宏观协议根据不可信模型的历史行为动态维护其对齐置信度,据此选择更安全或更具风险的微观协议。我们在代码生成测试平台中评估该方法:红队试图利用LLM生成具有隐蔽后门的代码,而蓝队通过部署保障措施进行防护。我们绘制了安全性(无后门解决方案数量)与实用性(正确解决方案数量)的帕累托前沿曲线。在给定实用性水平下,相比非自适应基线方法,我们的自适应部署策略将后门数量降低了80%。