When Scanners Lie: Evaluator Instability in LLM Red-Teaming

Automated LLM vulnerability scanners are increasingly used to assess security risks by measuring different attack type success rates (ASR). Yet the validity of these measurements hinges on an often-overlooked component: the evaluator who determines whether an attack has succeeded. In this study, we demonstrate that commonly used open-source scanners exhibit measurement instability that depends on the evaluator component. Consequently, changing the evaluator while keeping the attacks and model outputs constant can significantly alter the reported ASR. To tackle this problem, we present a two-phase, reliability-aware evaluation framework. In the first phase, we quantify evaluator disagreement to identify attack categories where ASR reliability cannot be assumed. In the second phase, we propose a verification-based evaluation method where evaluators are validated by an independent verifier, enabling reliability assessment without relying on extensive human annotation. Applied to the widely used Garak scanner, we observe that 22 of 25 attack categories exhibit evaluator instability, reflected in high disagreement among evaluators. Our approach raises evaluator accuracy from 72% to 89% while enabling selective deployment to control cost and computational overhead. We further quantify evaluator uncertainty in ASR estimates, showing that reported vulnerability scores can vary by up to 33% depending on the evaluator. Our results indicate that the outputs of vulnerability scanners are highly sensitive to the choice of evaluators. Our framework offers a practical approach to quantify unreliable evaluations and enhance the reliability of measurements in automated LLM security assessments.

翻译：自动化LLM漏洞扫描器正日益通过测量不同攻击类型的成功率（ASR）来评估安全风险。然而，这些测量的有效性取决于一个常被忽视的组件：判断攻击是否成功的评估器。本研究证明，常用的开源扫描器存在依赖于评估器组件的测量不稳定性。因此，在保持攻击和模型输出不变的情况下更换评估器，可能显著改变报告的ASR。为解决该问题，我们提出了一个两阶段、可靠性感知的评估框架。第一阶段，我们量化评估器间的分歧，以识别ASR可靠性无法保证的攻击类别。第二阶段，我们提出一种基于验证的评估方法，通过独立验证器对评估器进行校验，从而无需依赖大量人工标注即可实现可靠性评估。应用于广泛使用的Garak扫描器时，我们观察到25个攻击类别中有22类存在评估器不稳定性，表现为评估器间的高分歧率。我们的方法将评估器准确率从72%提升至89%，同时支持选择性部署以控制成本和计算开销。我们进一步量化了ASR估计中评估器的不确定性，表明报告的漏洞评分可能因评估器选择而产生高达33%的波动。研究结果表明，漏洞扫描器的输出对评估器的选择高度敏感。本框架提供了一种量化不可靠评估的实用方法，可提升自动化LLM安全评估中测量的可靠性。