The benchmarks used to evaluate AI agents in security-critical roles suffer from crucial weaknesses. Building on recent empirical evidence, we characterize three core challenges that undermine security evaluations: benchmark vulnerabilities, temporal staleness, and runtime uncertainty. We then outline practical directions toward building more robust and trustworthy evaluation frameworks.
翻译:用于评估安全关键角色中AI智能体的基准测试存在重大缺陷。基于最新实证证据,我们归纳了削弱安全性评估的三个核心挑战:基准测试漏洞、时间上的陈旧性以及运行时不确定性。随后,我们概述了构建更稳健、更可信评估框架的实践方向。