We introduce a reusable framework for auditing whether LLM attack benchmarks collectively cover the threat surface: a 4$\times$6 Target $\times$ Technique matrix grounded in STRIDE, constructed from a 507-leaf taxonomy -- 401 data-populated and 106 threat-model-derived leaves -- of inference-time attacks extracted from 932 arXiv security studies (2023--2026). The matrix enables benchmark-external validation -- auditing collective coverage rather than individual benchmark consistency. Applying it to six public benchmarks reveals that the three primary frameworks (HarmBench, InjecAgent, AgentDojo) occupy non-overlapping cells covering at most 25\% of the matrix, while entire STRIDE threat categories (Service Disruption, Model Internals) lack any standardized evaluation, despite published attacks in these categories achieving 46$\times$ token amplification and 96\% attack success rates through mechanisms which no benchmark tests. The corpus of 2,521 unique attack groups further reveals pervasive naming fragmentation (up to 29 surface forms for a single attack) and heavy concentration in Safety \& Alignment Bypass, structural properties invisible at smaller scale. The taxonomy, attack records, and coverage mappings are released as extensible artifacts; as new benchmarks emerge, they can be mapped onto the same matrix, enabling the community to track whether evaluation gaps are closing.
翻译:我们提出了一个可复用的框架,用于审计大语言模型(LLM)攻击基准是否共同覆盖了威胁面:一个基于STRIDE构建的4×6目标×技术矩阵。该矩阵源自包含507个叶子节点的分类体系——其中401个为数据填充节点,106个为威胁模型衍生节点——这些节点提取自932篇arXiv安全研究论文(2023-2026年),聚焦于推理时攻击。该矩阵实现了基准外部验证,即审计整体覆盖度而非单个基准的一致性。将其应用于六个公开基准表明,三个主要框架(HarmBench、InjecAgent、AgentDojo)占据了不重叠的单元格,最多覆盖矩阵的25%;而整个STRIDE威胁类别(服务中断、模型内部)缺乏任何标准化评估,尽管这些类别中已发表的攻击通过任何基准均未测试的机制,实现了46倍的令牌放大和96%的攻击成功率。对2521个独特攻击组语料库的进一步分析揭示了普遍存在的命名碎片化问题(单个攻击最多有29种表面形式)以及高度集中于安全与对齐绕过类别,这些结构属性在小规模下无法显现。该分类体系、攻击记录及覆盖度映射将作为可扩展工件发布;随着新基准的出现,它们可以被映射至同一矩阵,从而使研究社区能够追踪评估漏洞是否正在缩小。