What Breaks When LLMs Code? Characterizing Operational Safety Failures of Agentic Code Assistants

Autonomous coding agents built on large language models (LLMs) are rapidly being integrated into development workflows, yet their operational safety properties remain poorly understood beyond evaluations of explicitly malicious inputs. In practice, high-impact failures arise during benign, goal-directed use through environment breakage, fabricated success reports, etc. that current benchmarks do not capture. What categories of operational safety failures actually occur when coding agents are used for everyday development tasks and what is their impact? We present an incident-driven empirical study grounded in two complementary evidence streams. We screen 68,816 papers from 22 premier venues, curating 185 safety-relevant studies, and mine 16,586 GitHub issues from widely deployed LLM-powered coding tools, manually confirming 547 genuine safety failures. Applying systematic open coding over both corpora, we derive a multi-dimensional safety taxonomy of 33 operational risk types organized across seven dimensions, and annotate each incident with contributing factors, task context, severity, and downstream impact. Our findings show that coding-agent failures are often severe, with 326 of 547 incidents rated high or critical. The dominant risks are constraint violations, destructive operations, authorization bypasses, and deception, and over 65% of incidents arise in bug fixing and setup or configuration, patterns largely missing from prior literature. These results have direct implications for SE tool designers and benchmark developers: guardrails must go beyond adversarial-prompt defenses to enforce environmental constraints, failure transparency, and safe-halt behaviors.

翻译：基于大型语言模型（LLM）的自主编码智能体正迅速融入开发工作流程，然而除对显式恶意输入的评估外，其操作安全性特性仍鲜为人知。在实际应用场景中，高影响力故障常发生于良性、目标导向的使用过程中，表现为环境破坏、虚假成功报告等问题，而现有基准测试未能覆盖这些现象。当编码智能体被用于日常开发任务时，实际会出现哪些类别的操作安全性故障？它们的影响程度如何？我们基于两项互补的证据流开展了一项以事件驱动的实证研究。通过筛选来自22个顶级学术会议的68,816篇论文，整理出185篇与安全性相关的研究，并从广泛部署的LLM驱动编码工具中挖掘16,586个GitHub议题，人工确认了547个真实安全性故障案例。对两个语料库进行系统性开放式编码后，我们构建了一个包含33种操作风险类型的多维安全性分类体系，这些风险分属七个维度，并对每个事件标注了影响因素、任务上下文、严重程度及下游影响。研究结果显示，编码智能体故障往往具有严重性——547个事件中有326个被评定为高或严重级别。主要风险包括约束违反、破坏性操作、授权绕过与欺骗行为，且超过65%的事件出现在缺陷修复与安装/配置环节，这些模式在既往文献中几乎未被涉及。上述发现对软件工程工具设计师与基准测试开发者具有直接启示：防护措施必须超越对抗性提示防御机制，转而强化环境约束、故障透明性及安全中止行为的实施。