The Distributed Detectability Band Against Marginal-Preserving Attacks

AI-control monitors score individual agent actions to detect misbehavior, but real harm can be distributed across many benign-looking steps, each individually below any per-step alarm. We construct a marginal-preserving, correlation-encoded distributed-sabotage attack using a Gaussian-copula AR(1) construction: the per-step monitor-score marginal is held exactly equal to benign, so mean, max, top-k tail, and threshold monitors (Monitor A) are defeated by construction, while harm is encoded in the temporal correlation structure. We sequence the paper around three reviewer-mandated gates. (1) Realizability gate: the stealthy attack achieves KS-distance to benign of 0.013 (effectively zero) at all tested harm levels up to 3.0, confirming that harm is fully decoupled from the per-step marginal and realizability is not harm-limited. (2) Monitor-A-vs-B reconciliation: we show formally that the attack, built against Monitor A's score marginal, remains marginal-preserving under a different-score Monitor B (the correlation/sequence family: CUSUM, SPRT, HMM-LR, runs test, autocorrelation, windowed logistic), and scope worst-case claims to score functions that admit a temporal signature. (3) Non-empty detectability band: Monitor A achieves AUC 0.52 (chance); Monitor B spans AUC 0.79-0.97 at the same 1% FPR target, and as harm is amortized over more steps Monitor A collapses to chance while Monitor B holds at AUC ~0.95. These results demonstrate a non-empty detectability band and characterize the sub-threshold sabotage frontier: distribution-shape monitors fail by construction; temporal-correlation monitors can detect but are not trivially optimal.

翻译：人工智能控制监控器通过评分个体智能体的行为来检测异常行为，但实际危害可能分布于多个看似良性的步骤中，每个步骤单独低于任何单步警报阈值。我们采用高斯连接函数AR(1)构造方法，构建了一种保持边际分布、编码相关性的分布式破坏攻击：单步监控器评分边际分布精确等于良性行为，因此均值、最大值、前k个尾部及阈值监控器（A类监控器）在构造上被击败，而危害被编码在时间相关性结构中。本文围绕审稿人要求的三个验证关卡展开：(1)可实现性验证：该隐蔽攻击在所有测试危害等级（最高达3.0）下，与良性行为的KS距离均为0.013（实际为零），证实危害与单步边际分布完全解耦，且可实现性不受危害限制。(2)A类与B类监控器调和：我们形式化证明，针对A类监控器评分边际分布构建的攻击，在采用不同评分函数的B类监控器（相关性/序列族：CUSUM、SPRT、HMM-LR、游程检验、自相关、窗口逻辑回归）下仍保持边际分布不变，并将最坏情况结论限定于能捕捉时间特征的评分函数。(3)非空可检测带：A类监控器AUC为0.52（随机水平）；在相同1%假阳性率目标下，B类监控器AUC范围为0.79-0.97；当危害被分摊至更多步骤时，A类监控器性能降至随机水平，而B类监控器维持AUC约0.95。这些结果证实了非空可检测带的存在，并刻画了阈值下破坏攻击的前沿：分布形态监控器因构造原因失效，而时间相关性监控器可检测此类攻击，但并非平凡最优。