The Best-Laid SCHEMEs: Coordinated Sabotage and Monitoring in Multi-Agent Systems

As agentic coding systems decompose work across multiple model instances, a critical safety question is whether those instances can coordinate to achieve a hidden malicious objective while remaining aligned with user intent. We introduce SCHEME, a benchmark of 17 task instances across 7 settings and 8 real open-source libraries, each pairing a legitimate software-engineering task with a covert side task. Every setting is designed so that no proper subset of agents can succeed alone: agents must decompose a shared sabotage plan, relay partial requirements under different communication topologies, and execute mutually consistent edits, testing genuine multi-agent coordination rather than individual capability. Evaluating with GPT 5.1 Codex and Gemini 3.1 Pro, we find coordinated sabotage is already practical, with Gemini completing the covert objective while succeeding on the legitimate task in 84\% of samples and Codex in 46\%. A failure analysis shows that recovery, not failure incidence, drives the model gap: both models suffer similar numbers of communication failures per trajectory, but Gemini recovers from $\sim$73\% of occurrences against Codex's $\sim$56\%. However, the agents' code edits remain detectable: a trusted monitor flags coordinated sabotage at FPR$\le$2\% in 99\%/68\% (Gemini/Codex) of attempts from code edits alone, rising to 100\%/81\% with access to agent communication. Despite stealth-aware prompting, undetected sabotage stays at $<$3\% under the full-visibility monitor; telling agents a monitor is present halves Gemini's sabotage success without significantly breaking detection, even when agents are explicitly prompted to fabricate cover stories.

翻译：随着智能体编码系统将任务分解至多个模型实例，一个关键的安全问题随之产生：这些实例能否在保持与用户意图一致的同时，协同实现隐藏的恶意目标。我们提出SCHEME基准，涵盖7种设置下17个任务实例及8个真实开源库，每个实例均将一项合法软件工程任务与一项隐蔽的副任务配对。所有设置均设计为：任何真子集智能体均无法单独成功——智能体必须分解共享的破坏计划，在不同通信拓扑下传递部分需求，并执行相互一致的代码修改，以此检验真正的多智能体协同能力而非个体能力。使用GPT 5.1 Codex和Gemini 3.1 Pro进行评估，我们发现协同破坏已具备现实可行性：Gemini在84%的样本中成功完成隐蔽目标且同时完成合法任务，Codex则为46%。失败分析表明，恢复能力而非故障发生率是模型差距的关键驱动因素：两个模型每次轨迹中通信故障数量相近，但Gemini能从约73%的故障中恢复，而Codex仅约56%。然而，智能体的代码修改仍可被检测：在仅基于代码修改的检测中，可信监控器在99%（Gemini）/68%（Codex）的尝试中实现FPR≤2%的协同破坏标记率；当可访问智能体通信时，标记率升至100%/81%。即便采用反监控提示，在完全可见性监控器下，未被检测的破坏仍低于3%；明确告知智能体存在监控器时，Gemini的破坏成功率减半，且检测能力未显著下降——即使智能体被明确提示编造掩饰故事，这一结论依然成立。