Stateful Online Monitoring Catches Distributed Agent Attacks

Language models can find thousands of severe software vulnerabilities, and agents are increasingly being misused for cyberattacks. To avoid detection, attackers frequently distribute their misuse, splitting a harmful task across many user accounts so each individual transcript looks benign. Because safety monitors score only one agent context at a time, they are structurally blind to misuse that is only visible in aggregate, across many accounts. We show this gap is real by building, to our knowledge, the first distributed agent attack, a multi-agent scaffold that completes hard cybersecurity tasks while hiding the harmful objective across subagents with limited contexts, evading a standard monitor that catches it only a fifth as often as prior agent attacks. Towards a defense, we develop an online stateful monitor that uses real-time clustering to collect weak suspiciousness signals across many agent transcripts, and escalates only rarely to a language model that flags misuse across user accounts. In evaluations with large-scale simulated datacenter traffic, our monitor Pareto dominates standard monitors, catching distributed attacks 30% earlier and flagging cyber misuse before it reaches the most harmful stages. Crucially, this comes at negligible additional latency for ~99% of user traffic. This detection advantage persists but narrows as the benign background traffic grows very large. After an extensive red-teaming exercise, we improve the defense and surprisingly also find that it catches standard jailbreaks, since adaptive attackers reuse attack variants across accounts. Our results point toward a new class of safety monitors which reason over groups of users rather than isolated transcripts.

翻译：语言模型能发现数千个严重软件漏洞，而代理正日益被滥用于网络攻击。为避免检测，攻击者常将滥用行为分散化——将有害任务拆分到多个用户账户中，使每个独立记录看似无害。由于安全监控器每次只评估单个代理上下文，它们在结构上无法察觉那些需跨账户聚合才能显现的滥用行为。我们通过构建（据我们所知）首个分布式代理攻击来证明这一漏洞：一种多代理框架能在完成高难度网络安全任务的同时，将有害目标隐藏于有限上下文的子代理间，使标准监控器对其捕获率仅为先前代理攻击的五分之一。针对防御，我们开发了一种在线状态化监控器，通过实时聚类收集跨多个代理记录的弱可疑信号，仅间或升级至语言模型以标记跨账户滥用行为。在大规模模拟数据中心流量评估中，我们的监控器在帕累托效率上优于标准监控器，可将分布式攻击的捕获时间提前30%，并在网络滥用达到最有害阶段前进行标记。关键在于，这对约99%的用户流量带来的额外延迟可忽略不计。随着良性背景流量规模剧增，该检测优势虽持续存在但有所减弱。经过全面红队演练，我们改进了防御机制，并意外发现它还能捕获标准越狱攻击——因为自适应攻击者会跨账户复用攻击变体。我们的研究结果指向一类新型安全监控器，它们基于用户群体而非孤立记录进行推理。