自适应部署不可信大语言模型以降低分布式威胁 (Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats)

As large language models (LLMs) become increasingly capable, it is prudent to assess whether safety measures remain effective even if LLMs intentionally try to bypass them. Previous work introduced control evaluations, an adversarial framework for testing deployment strategies of untrusted models (i.e., models which might be trying to bypass safety measures). While prior work treats a single failure as unacceptable, we perform control evaluations in a "distributed threat setting" -- a setting where no single action is catastrophic and no single action provides overwhelming evidence of misalignment. We approach this problem with a two-level deployment framework that uses an adaptive macro-protocol to choose between micro-protocols. Micro-protocols operate on a single task, using a less capable, but extensively tested (trusted) model to harness and monitor the untrusted model. Meanwhile, the macro-protocol maintains an adaptive credence on the untrusted model's alignment based on its past actions, using it to pick between safer and riskier micro-protocols. We evaluate our method in a code generation testbed where a red team attempts to generate subtly backdoored code with an LLM whose deployment is safeguarded by a blue team. We plot Pareto frontiers of safety (# of non-backdoored solutions) and usefulness (# of correct solutions). At a given level of usefulness, our adaptive deployment strategy reduces the number of backdoors by 80% compared to non-adaptive baselines.

翻译：随着大语言模型（LLM）能力日益增强，有必要评估安全措施在LLM蓄意规避时是否依然有效。先前研究引入了控制评估——一种用于测试不可信模型（即可能试图规避安全措施的模型）部署策略的对抗性框架。以往工作将单次失败视为不可接受，而我们在“分布式威胁场景”下进行控制评估——该场景中单次行动不会造成灾难性后果，且单次行动也无法提供决定性的模型失准证据。我们采用双层部署框架解决该问题：通过自适应宏观协议在微观协议间进行选择。微观协议作用于单一任务，使用能力较弱但经过充分测试的可信模型来利用并监控不可信模型；与此同时，宏观协议根据不可信模型的历史行为动态维护其对齐置信度，据此选择更安全或更具风险的微观协议。我们在代码生成测试平台中评估该方法：红队试图利用LLM生成具有隐蔽后门的代码，而蓝队通过部署保障措施进行防护。我们绘制了安全性（无后门解决方案数量）与实用性（正确解决方案数量）的帕累托前沿曲线。在给定实用性水平下，相比非自适应基线方法，我们的自适应部署策略将后门数量降低了80%。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日