Frontier AI developers are relying on layers of safeguards to protect against catastrophic misuse of AI systems. Anthropic guards their latest Claude 4 Opus model using one such defense pipeline, and other frontier developers including Google DeepMind and OpenAI pledge to soon deploy similar defenses. However, the security of such pipelines is unclear, with limited prior work evaluating or attacking these pipelines. We address this gap by developing and red-teaming an open-source defense pipeline. First, we find that a novel few-shot-prompted input and output classifier outperforms state-of-the-art open-weight safeguard model ShieldGemma across three attacks and two datasets, reducing the attack success rate (ASR) to 0% on the catastrophic misuse dataset ClearHarm. Second, we introduce a STaged AttaCK (STACK) procedure that achieves 71% ASR on ClearHarm in a black-box attack against the few-shot-prompted classifier pipeline. Finally, we also evaluate STACK in a transfer setting, achieving 33% ASR, providing initial evidence that it is feasible to design attacks with no access to the target pipeline. We conclude by suggesting specific mitigations that developers could use to thwart staged attacks.
翻译:前沿人工智能开发者正依赖多层防护措施来防止AI系统的灾难性滥用。Anthropic公司通过此类防御流程保护其最新的Claude 4 Opus模型,包括Google DeepMind和OpenAI在内的其他前沿开发者亦承诺将很快部署类似防御。然而,此类防护流程的安全性尚不明确,此前评估或攻击这些流程的研究极为有限。我们通过开发并红队测试开源防御流程来填补这一空白。首先,我们发现新型少样本提示输入输出分类器在三类攻击和两个数据集上均优于当前最优开源防护模型ShieldGemma,在灾难性滥用数据集ClearHarm上将攻击成功率降至0%。其次,我们提出分阶段攻击流程,在针对少样本提示分类器流程的黑盒攻击中,于ClearHarm数据集上实现71%的攻击成功率。最后,我们在迁移场景中评估STACK攻击,获得33%的攻击成功率,初步证明无需接触目标流程即可设计攻击的可行性。我们最终提出开发者可采用的具体缓解措施以抵御分阶段攻击。