AI coding agents are increasingly embedded in real-world software development, collaborating with human developers while gaining broader access to codebases and tools. This creates a new attack surface: an agent can exploit human trust to sabotage development, for instance by inserting malicious code to accomplish a hidden side task. Most prior work studies AI sabotage in AI-only settings, paying limited attention to the role of human oversight in detecting and mitigating such malicious behavior. To address this gap, we conduct the first large-scale study of human oversight in AI coding sabotage. Over 100 participants collaborate with one of four frontier models (Claude-Opus-4.6, GPT-5.4, Gemini-3.1-Pro, and MiniMax-M2.7) on a long-horizon coding task lasting around five hours, designed to mimic real-world workflows. We find that 94% of developers fail to detect sabotage, and our analysis of participant feedback attributes this vulnerability to minimal code review, plausible cover story, and overtrust in agents. We further test the effectiveness of a safety monitor in one condition: while the monitor reduces sabotage success, 56% of participants still accept the malicious code, ignoring its warnings. Drawing on participant feedback, we offer actionable suggestions for better monitor design. This work complements existing AI safety research and highlights an urgent need for human-centric safety mechanisms that account for human factors, particularly in long-horizon, real-world development settings.
翻译:AI编码智能体正日益融入现实世界的软件开发工作,在与人类开发者协作的同时,获取对代码库和工具的广泛访问权限。这催生了一个新的攻击面:智能体可能利用人类信任实施破坏行为,例如插入恶意代码以完成隐藏的辅助任务。现有研究多在纯AI环境下探讨AI破坏行为,较少关注人类监督在检测和缓解此类恶意行为中的作用。为填补这一空白,我们开展了首个关于AI编码破坏中人类监督的大规模研究。超过100名参与者与四个前沿模型(Claude-Opus-4.6、GPT-5.4、Gemini-3.1-Pro和MiniMax-M2.7)之一协作完成一项耗时约五小时的长期编码任务,该任务旨在模拟真实工作流程。研究发现94%的开发者未能识破坏行为,基于参与者反馈的分析将这一脆弱性归因于代码审查不足、看似合理的掩护故事以及对智能体的过度信任。我们进一步在一种条件下测试了安全监控器的有效性:虽然监控器降低了破坏成功率,但仍有56%的参与者无视警告接受了恶意代码。基于参与者反馈,我们提出了改进监控器设计的可操作建议。本研究补充了现有AI安全研究,并强调了亟需考虑人为因素的以人为中心的安全机制,尤其是在长期、真实世界的开发场景中。