Coding with "Enemy": Can Human Developers Detect AI Agent Sabotage?

AI coding agents are increasingly embedded in real-world software development, collaborating with human developers while gaining broader access to codebases and tools. This creates a new attack surface: an agent can exploit human trust to sabotage development, for instance by inserting malicious code to accomplish a hidden side task. Most prior work studies AI sabotage in AI-only settings, paying limited attention to the role of human oversight in detecting and mitigating such malicious behavior. To address this gap, we conduct the first large-scale study of human oversight in AI coding sabotage. Over 100 participants collaborate with one of four frontier models (Claude-Opus-4.6, GPT-5.4, Gemini-3.1-Pro, and MiniMax-M2.7) on a long-horizon coding task lasting around five hours, designed to mimic real-world workflows. We find that 94% of developers fail to detect sabotage, and our analysis of participant feedback attributes this vulnerability to minimal code review, plausible cover story, and overtrust in agents. We further test the effectiveness of a safety monitor in one condition: while the monitor reduces sabotage success, 56% of participants still accept the malicious code, ignoring its warnings. Drawing on participant feedback, we offer actionable suggestions for better monitor design. This work complements existing AI safety research and highlights an urgent need for human-centric safety mechanisms that account for human factors, particularly in long-horizon, real-world development settings.

翻译：AI编码智能体正日益融入现实世界的软件开发工作，在与人类开发者协作的同时，获取对代码库和工具的广泛访问权限。这催生了一个新的攻击面：智能体可能利用人类信任实施破坏行为，例如插入恶意代码以完成隐藏的辅助任务。现有研究多在纯AI环境下探讨AI破坏行为，较少关注人类监督在检测和缓解此类恶意行为中的作用。为填补这一空白，我们开展了首个关于AI编码破坏中人类监督的大规模研究。超过100名参与者与四个前沿模型（Claude-Opus-4.6、GPT-5.4、Gemini-3.1-Pro和MiniMax-M2.7）之一协作完成一项耗时约五小时的长期编码任务，该任务旨在模拟真实工作流程。研究发现94%的开发者未能识破坏行为，基于参与者反馈的分析将这一脆弱性归因于代码审查不足、看似合理的掩护故事以及对智能体的过度信任。我们进一步在一种条件下测试了安全监控器的有效性：虽然监控器降低了破坏成功率，但仍有56%的参与者无视警告接受了恶意代码。基于参与者反馈，我们提出了改进监控器设计的可操作建议。本研究补充了现有AI安全研究，并强调了亟需考虑人为因素的以人为中心的安全机制，尤其是在长期、真实世界的开发场景中。

相关内容

关注 7111

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

《人工智能在网络防御中的机遇》

专知会员服务

12+阅读 · 6月8日

保护网络物理系统中的 AI 智能体：关于环境交互、深度伪造威胁及其防御技术的综述

专知会员服务

10+阅读 · 2月15日

专业软件开发者不靠“氛围编程”（Vibe Coding），而靠“控制”：2025 年 AI Agent 在编程中的应用研究

专知会员服务

22+阅读 · 2025年12月31日

AI智能体面临的威胁：关键安全挑战与未来路径综述

专知会员服务

53+阅读 · 2024年6月7日