Developers try to evaluate whether an AI system can be misused by adversaries before releasing it; for example, they might test whether a model enables cyberoffense, user manipulation, or bioterrorism. In this work, we show that individually testing models for misuse is inadequate; adversaries can misuse combinations of models even when each individual model is safe. The adversary accomplishes this by first decomposing tasks into subtasks, then solving each subtask with the best-suited model. For example, an adversary might solve challenging-but-benign subtasks with an aligned frontier model, and easy-but-malicious subtasks with a weaker misaligned model. We study two decomposition methods: manual decomposition where a human identifies a natural decomposition of a task, and automated decomposition where a weak model generates benign tasks for a frontier model to solve, then uses the solutions in-context to solve the original task. Using these decompositions, we empirically show that adversaries can create vulnerable code, explicit images, python scripts for hacking, and manipulative tweets at much higher rates with combinations of models than either individual model. Our work suggests that even perfectly-aligned frontier systems can enable misuse without ever producing malicious outputs, and that red-teaming efforts should extend beyond single models in isolation.
翻译:开发者通常在发布AI系统前评估其是否可能被对抗者滥用,例如测试模型是否可能被用于网络攻击、用户操纵或生物恐怖主义。本研究证明,仅对单个模型进行滥用测试是不够的;即使每个独立模型都是安全的,对抗者仍可能通过组合多个模型实现滥用。对抗者通过将任务分解为子任务,再为每个子任务选择最适合的模型来实现这一目标。例如,对抗者可能使用对齐的前沿模型解决具有挑战性但良性的子任务,同时使用较弱且未对齐的模型处理简单但恶意的子任务。我们研究了两种分解方法:人工分解(由人类识别任务的自然分解结构)和自动分解(由弱模型生成良性任务供前沿模型解决,再利用解决方案通过上下文学习完成原始任务)。通过实验证明,与使用单一模型相比,对抗者通过模型组合能够以更高成功率生成漏洞代码、露骨图像、黑客攻击用Python脚本以及操纵性推文。本研究表明,即使完全对齐的前沿系统也可能在不直接产生恶意输出的情况下促成滥用行为,因此红队测试应扩展到超越孤立单模型的范围。