We introduce a reusable framework for auditing whether LLM attack benchmarks collectively cover the threat surface: a 4$\times$6 Target $\times$ Technique matrix grounded in STRIDE, constructed from a 507-leaf taxonomy -- 401 data-populated and 106 threat-model-derived leaves -- of inference-time attacks extracted from 932 arXiv security studies (2023--2026). The matrix enables benchmark-external validation -- auditing collective coverage rather than individual benchmark consistency. Applying it to six public benchmarks reveals that the three primary frameworks (HarmBench, InjecAgent, AgentDojo) occupy non-overlapping cells covering at most 25\% of the matrix, while entire STRIDE threat categories (Service Disruption, Model Internals) lack any standardized evaluation, despite published attacks in these categories achieving 46$\times$ token amplification and 96\% attack success rates through mechanisms which no benchmark tests. The corpus of 2,521 unique attack groups further reveals pervasive naming fragmentation (up to 29 surface forms for a single attack) and heavy concentration in Safety \& Alignment Bypass, structural properties invisible at smaller scale. The taxonomy, attack records, and coverage mappings are released as extensible artifacts; as new benchmarks emerge, they can be mapped onto the same matrix, enabling the community to track whether evaluation gaps are closing.
翻译:我们提出一个可重用的框架,用于审计LLM攻击基准是否共同覆盖了威胁面:一个基于STRIDE的4×6目标×技术矩阵,该矩阵由从932篇arXiv安全研究(2023–2026年)中提取的推理时攻击构建而成,包含507个叶节点的分类体系——其中401个由数据填充,106个由威胁模型推导得出。该矩阵支持基准外部验证——审计的是整体覆盖范围而非单个基准的一致性。将其应用于六个公开基准后发现,三大主要框架(HarmBench、InjecAgent、AgentDojo)占据的单元格互不重叠,至多覆盖矩阵的25%,而完整的STRIDE威胁类别(服务中断、模型内部机制)缺乏任何标准化评估——尽管这些类别中已发表的攻击通过任何基准均未测试的机制实现了46倍的令牌放大和96%的攻击成功率。由2521个独特攻击组构成的语料库进一步揭示了普遍存在的命名碎片化(单个攻击最多有29种表面形式)以及高度集中于安全与对齐规避的现象,这些结构属性在较小规模下无法显现。该分类体系、攻击记录及覆盖映射将作为可扩展构件发布;随着新基准的出现,它们可以被映射到同一矩阵上,从而使社区能够跟踪评估缺口是否正在缩小。