We introduce AutoAdvExBench, a benchmark to evaluate if large language models (LLMs) can autonomously exploit defenses to adversarial examples. Unlike existing security benchmarks that often serve as proxies for real-world tasks, bench directly measures LLMs' success on tasks regularly performed by machine learning security experts. This approach offers a significant advantage: if a LLM could solve the challenges presented in bench, it would immediately present practical utility for adversarial machine learning researchers. We then design a strong agent that is capable of breaking 75% of CTF-like ("homework exercise") adversarial example defenses. However, we show that this agent is only able to succeed on 13% of the real-world defenses in our benchmark, indicating the large gap between difficulty in attacking "real" code, and CTF-like code. In contrast, a stronger LLM that can attack 21% of real defenses only succeeds on 54% of CTF-like defenses. We make this benchmark available at https://github.com/ethz-spylab/AutoAdvExBench.
翻译:我们提出了AutoAdvExBench,这是一个用于评估大型语言模型(LLMs)能否自主利用对抗样本防御的基准测试。与通常作为现实世界任务代理的现有安全基准不同,该基准直接衡量LLMs在机器学习安全专家日常执行任务上的成功率。这种方法具有显著优势:如果LLM能够解决基准中提出的挑战,它将立即为对抗性机器学习研究者带来实际效用。我们随后设计了一个强大的智能体,能够攻破75%的CTF式("课后练习"型)对抗样本防御。然而,我们发现该智能体仅能成功攻破我们基准测试中13%的现实世界防御,这表明攻击"真实"代码与CTF式代码之间存在巨大难度差距。相比之下,一个能攻破21%现实防御的更强LLM仅在54%的CTF式防御上取得成功。本基准测试已在https://github.com/ethz-spylab/AutoAdvExBench开源发布。