Capability evaluations play a critical role in ensuring the safe deployment of frontier AI systems, but this role may be undermined by intentional underperformance or ``sandbagging.'' We present a novel model-agnostic method for detecting sandbagging behavior using noise injection. Our approach is founded on the observation that introducing Gaussian noise into the weights of models either prompted or fine-tuned to sandbag can considerably improve their performance. We test this technique across a range of model sizes and multiple-choice question benchmarks (MMLU, AI2, WMDP). Our results demonstrate that noise injected sandbagging models show performance improvements compared to standard models. Leveraging this effect, we develop a classifier that consistently identifies sandbagging behavior. Our unsupervised technique can be immediately implemented by frontier labs or regulatory bodies with access to weights to improve the trustworthiness of capability evaluations.
翻译:能力评估在确保前沿人工智能系统安全部署中起着关键作用,但这一作用可能因故意表现不佳或"沙袋化"行为而受到削弱。我们提出了一种新颖的模型无关方法,通过噪声注入来检测沙袋化行为。该方法基于以下观察:在已被提示或微调为沙袋化行为的模型权重中引入高斯噪声,可显著提升其性能表现。我们在多种模型规模和多选题基准测试(MMLU、AI2、WMDP)上验证了该技术。实验结果表明,注入噪声的沙袋化模型相较于标准模型展现出性能提升。基于此效应,我们开发了一种能持续识别沙袋化行为的分类器。这种无监督技术可由前沿实验室或有权访问模型权重的监管机构立即部署,从而提升能力评估的可信度。