Adversaries Can Misuse Combinations of Safe Models

Developers try to evaluate whether an AI system can be misused by adversaries before releasing it; for example, they might test whether a model enables cyberoffense, user manipulation, or bioterrorism. In this work, we show that individually testing models for misuse is inadequate; adversaries can misuse combinations of models even when each individual model is safe. The adversary accomplishes this by first decomposing tasks into subtasks, then solving each subtask with the best-suited model. For example, an adversary might solve challenging-but-benign subtasks with an aligned frontier model, and easy-but-malicious subtasks with a weaker misaligned model. We study two decomposition methods: manual decomposition where a human identifies a natural decomposition of a task, and automated decomposition where a weak model generates benign tasks for a frontier model to solve, then uses the solutions in-context to solve the original task. Using these decompositions, we empirically show that adversaries can create vulnerable code, explicit images, python scripts for hacking, and manipulative tweets at much higher rates with combinations of models than either individual model. Our work suggests that even perfectly-aligned frontier systems can enable misuse without ever producing malicious outputs, and that red-teaming efforts should extend beyond single models in isolation.

翻译：开发者通常在发布AI系统前评估其是否可能被对抗者滥用，例如测试模型是否可能被用于网络攻击、用户操纵或生物恐怖主义。本研究证明，仅对单个模型进行滥用测试是不够的；即使每个独立模型都是安全的，对抗者仍可能通过组合多个模型实现滥用。对抗者通过将任务分解为子任务，再为每个子任务选择最适合的模型来实现这一目标。例如，对抗者可能使用对齐的前沿模型解决具有挑战性但良性的子任务，同时使用较弱且未对齐的模型处理简单但恶意的子任务。我们研究了两种分解方法：人工分解（由人类识别任务的自然分解结构）和自动分解（由弱模型生成良性任务供前沿模型解决，再利用解决方案通过上下文学习完成原始任务）。通过实验证明，与使用单一模型相比，对抗者通过模型组合能够以更高成功率生成漏洞代码、露骨图像、黑客攻击用Python脚本以及操纵性推文。本研究表明，即使完全对齐的前沿系统也可能在不直接产生恶意输出的情况下促成滥用行为，因此红队测试应扩展到超越孤立单模型的范围。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/