AI systems can take harmful actions and are highly vulnerable to adversarial attacks. We present an approach, inspired by recent advances in representation engineering, that interrupts the models as they respond with harmful outputs with "circuit breakers." Existing techniques aimed at improving alignment, such as refusal training, are often bypassed. Techniques such as adversarial training try to plug these holes by countering specific attacks. As an alternative to refusal training and adversarial training, circuit-breaking directly controls the representations that are responsible for harmful outputs in the first place. Our technique can be applied to both text-only and multimodal language models to prevent the generation of harmful outputs without sacrificing utility -- even in the presence of powerful unseen attacks. Notably, while adversarial robustness in standalone image recognition remains an open challenge, circuit breakers allow the larger multimodal system to reliably withstand image "hijacks" that aim to produce harmful content. Finally, we extend our approach to AI agents, demonstrating considerable reductions in the rate of harmful actions when they are under attack. Our approach represents a significant step forward in the development of reliable safeguards to harmful behavior and adversarial attacks.
翻译:人工智能系统可能采取有害行动,且极易受到对抗性攻击。受表征工程领域最新进展的启发,我们提出一种方法,在模型生成有害输出时通过"断路器"中断其响应。旨在提升对齐性的现有技术(如拒绝训练)常被绕过。对抗训练等技术试图通过抵御特定攻击来填补这些漏洞。作为拒绝训练和对抗训练的替代方案,断路器机制直接控制导致有害输出的底层表征。我们的技术可应用于纯文本和多模态语言模型,在不牺牲实用性的前提下防止有害内容生成——即使在遭遇未见过的强大攻击时亦然。值得注意的是,尽管独立图像识别中的对抗鲁棒性仍是开放难题,但断路器能使更大的多模态系统可靠抵御旨在生成有害内容的图像"劫持"攻击。最后,我们将该方法扩展至智能体领域,证明其在遭受攻击时能显著降低有害行为发生率。本方法标志着在开发针对有害行为和对抗攻击的可靠防护机制方面取得了重要进展。