AI-control monitors score individual agent actions to detect misbehavior, but real harm can be distributed across many benign-looking steps, each individually below any per-step alarm. We construct a marginal-preserving, correlation-encoded distributed-sabotage attack using a Gaussian-copula AR(1) construction: the per-step monitor-score marginal is held exactly equal to benign, so mean, max, top-k tail, and threshold monitors (Monitor A) are defeated by construction, while harm is encoded in the temporal correlation structure. We sequence the paper around three reviewer-mandated gates. (1) Realizability gate: the stealthy attack achieves KS-distance to benign of 0.013 (effectively zero) at all tested harm levels up to 3.0, confirming that harm is fully decoupled from the per-step marginal and realizability is not harm-limited. (2) Monitor-A-vs-B reconciliation: we show formally that the attack, built against Monitor A's score marginal, remains marginal-preserving under a different-score Monitor B (the correlation/sequence family: CUSUM, SPRT, HMM-LR, runs test, autocorrelation, windowed logistic), and scope worst-case claims to score functions that admit a temporal signature. (3) Non-empty detectability band: Monitor A achieves AUC 0.52 (chance); Monitor B spans AUC 0.79-0.97 at the same 1% FPR target, and as harm is amortized over more steps Monitor A collapses to chance while Monitor B holds at AUC ~0.95. These results demonstrate a non-empty detectability band and characterize the sub-threshold sabotage frontier: distribution-shape monitors fail by construction; temporal-correlation monitors can detect but are not trivially optimal.
翻译:人工智能控制监控器通过评分个体智能体的行为来检测异常行为,但实际危害可能分布于多个看似良性的步骤中,每个步骤单独低于任何单步警报阈值。我们采用高斯连接函数AR(1)构造方法,构建了一种保持边际分布、编码相关性的分布式破坏攻击:单步监控器评分边际分布精确等于良性行为,因此均值、最大值、前k个尾部及阈值监控器(A类监控器)在构造上被击败,而危害被编码在时间相关性结构中。本文围绕审稿人要求的三个验证关卡展开:(1)可实现性验证:该隐蔽攻击在所有测试危害等级(最高达3.0)下,与良性行为的KS距离均为0.013(实际为零),证实危害与单步边际分布完全解耦,且可实现性不受危害限制。(2)A类与B类监控器调和:我们形式化证明,针对A类监控器评分边际分布构建的攻击,在采用不同评分函数的B类监控器(相关性/序列族:CUSUM、SPRT、HMM-LR、游程检验、自相关、窗口逻辑回归)下仍保持边际分布不变,并将最坏情况结论限定于能捕捉时间特征的评分函数。(3)非空可检测带:A类监控器AUC为0.52(随机水平);在相同1%假阳性率目标下,B类监控器AUC范围为0.79-0.97;当危害被分摊至更多步骤时,A类监控器性能降至随机水平,而B类监控器维持AUC约0.95。这些结果证实了非空可检测带的存在,并刻画了阈值下破坏攻击的前沿:分布形态监控器因构造原因失效,而时间相关性监控器可检测此类攻击,但并非平凡最优。