Future AI deployments will likely be monitored for malicious behaviour. The ability of these AIs to subvert monitors by adversarially selecting against them - attack selection - is particularly concerning. To study this, we let a red team create attack policies that attempt to insert attacks into code without being caught by a monitor in the concentrated BigCodeBench backdooring setting. We decompose attack selection into two problems: mapping attacks to a quality score and mapping quality scores to submission probabilities. We frame attack selection as a classification problem and show that safety is significantly more sensitive to FPR than TPR. We find that prompting the attacker model to reason about the monitor while being cautious with attack selection reduces safety from a baseline of 99% to 59% at 0.5% auditing budget, emphasizing that eliciting attack selection capabilities of models is vital to avoid overly optimistic safety scores in control evaluations.
翻译:未来AI部署很可能需要监控恶意行为。这些AI通过对抗性选择监控代理——即"攻击选择"——来规避监控的能力尤其令人担忧。为研究此问题,我们让红方团队在集中的BigCodeBench后门设置中创建攻击策略,试图在代码中插入攻击而不被监控器发现。我们将攻击选择分解为两个问题:将攻击映射到质量评分,以及将质量评分映射到提交概率。我们将攻击选择视为分类问题,并证明安全性对FPR(假阳性率)的敏感度显著高于TPR(真阳性率)。研究发现,在0.5%审计预算下,引导攻击者模型审慎推理监控器行为并谨慎选择攻击,会使安全性从基线99%降至59%,这表明激发模型的攻击选择能力对于避免控制评估中过于乐观的安全评分至关重要。