Frontier AI developers are relying on layers of safeguards to protect against catastrophic misuse of AI systems. Anthropic guards their latest Claude 4 Opus model using one such defense pipeline, and other frontier developers including Google DeepMind and OpenAI pledge to soon deploy similar defenses. However, the security of such pipelines is unclear, with limited prior work evaluating or attacking these pipelines. We address this gap by developing and red-teaming an open-source defense pipeline. First, we find that a novel few-shot-prompted input and output classifier outperforms state-of-the-art open-weight safeguard model ShieldGemma across three attacks and two datasets, reducing the attack success rate (ASR) to 0% on the catastrophic misuse dataset ClearHarm. Second, we introduce a STaged AttaCK (STACK) procedure that achieves 71% ASR on ClearHarm in a black-box attack against the few-shot-prompted classifier pipeline. Finally, we also evaluate STACK in a transfer setting, achieving 33% ASR, providing initial evidence that it is feasible to design attacks with no access to the target pipeline. We conclude by suggesting specific mitigations that developers could use to thwart staged attacks.
翻译:前沿人工智能开发者正依赖多层安全防护措施来防止AI系统的灾难性误用。Anthropic公司通过此类防御流程保护其最新的Claude 4 Opus模型,包括Google DeepMind和OpenAI在内的其他前沿开发者亦承诺将很快部署类似防御机制。然而,此类防护流程的安全性尚不明确,此前评估或攻击此类流程的研究极为有限。为填补这一空白,我们开发并红队测试了一个开源防御流程。首先,我们发现新型少样本提示的输入输出分类器在三种攻击和两个数据集上均优于当前最优的开源权重防护模型ShieldGemma,在灾难性误用数据集ClearHarm上将攻击成功率降至0%。其次,我们提出一种分阶段攻击(STACK)方法,在针对少样本提示分类器流程的黑盒攻击中,于ClearHarm数据集上实现了71%的攻击成功率。最后,我们在迁移场景中评估STACK方法,获得33%的攻击成功率,这为设计无需访问目标流程的攻击方案提供了初步可行性证据。最后,我们建议开发者可采用的具体缓解措施以抵御分阶段攻击。