We systematically evaluate the quality of widely used AI safety datasets from two perspectives: in isolation and in practice. In isolation, we examine how well these datasets reflect real-world attacks based on three key properties: driven by ulterior intent, well-crafted, and out-of-distribution. We find that these datasets overrely on "triggering cues": words or phrases with overt negative/sensitive connotations that are intended to trigger safety mechanisms explicitly, which is unrealistic compared to real-world attacks. In practice, we evaluate whether these datasets genuinely measure safety risks or merely provoke refusals through triggering cues. To explore this, we introduce "intent laundering": a procedure that abstracts away triggering cues from attacks (data points) while strictly preserving their malicious intent and all relevant details. Our results indicate that current AI safety datasets fail to faithfully represent real-world attacks due to their overreliance on triggering cues. In fact, once these cues are removed, all previously evaluated "reasonably safe" models become unsafe, including Gemini 3 Pro and Claude Sonnet 3.7. Moreover, when intent laundering is adapted as a jailbreaking technique, it consistently achieves high attack success rates, ranging from 90% to over 98%, under fully black-box access. Overall, our findings expose a significant disconnect between how model safety is evaluated and how real-world adversaries behave.
翻译:我们从孤立与实践两个维度,对广泛使用的AI安全数据集质量进行系统性评估。在孤立评估中,我们基于三个关键属性(受隐秘意图驱动、精心构建、分布外特性)检验这些数据集反映真实攻击的程度。研究发现,这些数据集过度依赖"触发线索"——即具有明显负面/敏感含义、旨在显式触发安全机制的词汇或短语,这与现实攻击模式存在显著差异。在实践评估中,我们检验这些数据集究竟是在真实衡量安全风险,还是仅通过触发线索引发拒绝响应。为此我们提出"意图漂白"方法:在严格保留恶意意图与所有相关细节的前提下,从攻击样本中抽象剥离触发线索。实验结果表明,当前AI安全数据集因过度依赖触发线索而无法真实反映现实攻击。实际上,当移除这些线索后,所有先前评估中"相对安全"的模型(包括Gemini 3 Pro和Claude Sonnet 3.7)均表现出安全隐患。此外,将意图漂白适配为越狱攻击技术时,在完全黑盒访问条件下持续实现90%至98%以上的高攻击成功率。总体而言,我们的研究揭示了模型安全评估方式与现实对抗行为之间存在严重脱节。