Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks

Code-capable large language model (LLM) agents are embedded in software engineering workflows where they can read, write, and execute code, raising "jailbreak" stakes beyond text-only settings. Prior evaluations emphasize refusal or harmful-text detection, leaving open whether agents compile and run malicious programs. We present JAWS-Bench (Jailbreaks Across WorkSpaces), a benchmark spanning three escalating workspace regimes mirroring attacker capability: empty (JAWS-0), single-file (JAWS-1), and multi-file (JAWS-M). We pair this with a hierarchical, executable-aware Judge Framework that tests (i) compliance, (ii) attack success, (iii) syntactic correctness, and (iv) runtime executability, to measure deployable harm. Across seven LLM backends from five families, prompt-only attacks in JAWS-0 achieve 61% compliance; 58% are harmful, 52% parse, and 27% run end-to-end. In JAWS-1, compliance reaches ~100% for stronger models with a mean ASR (Attack Success Rate) ~71%; JAWS-M raises mean ASR to ~75%, with 32% runnable attack code. Wrapping an LLM in an agent increases ASR by 1.6$\times$, by overturning initial refusals during planning and tool use. Similar trends hold for OpenHands, SWE-Agent, and OpenAI Codex, suggesting our JAWS-Bench is agent-agnostic. Category analyses identify which attack classes are most vulnerable and deployable, motivating execution-aware defenses and refusal-preserving agent designs.

翻译：具备代码能力的大型语言模型（LLM）代理嵌入在软件工程工作流程中，能够读取、编写和执行代码，这使得“越狱”风险超越了纯文本场景。以往的评估侧重于拒绝响应或有害文本检测，而未探讨代理是否会编译并运行恶意程序。我们提出JAWS-Bench（工作空间越狱基准测试），该基准涵盖三个逐步升级的工作空间场景以映射攻击者能力：空工作空间（JAWS-0）、单文件工作空间（JAWS-1）和多文件工作空间（JAWS-M）。我们将其与分层的、可执行感知的判据框架相结合，通过测试（i）合规性、（ii）攻击成功率、（iii）语法正确性以及（iv）运行时可执行性，来评估可部署的危害。在来自五个家族的七个LLM后端的测试中，JAWS-0中的纯提示攻击实现了61%的合规性；其中58%具有危害性，52%可解析，27%可端到端运行。在JAWS-1中，对于更强模型，合规性达到约100%，平均ASR（攻击成功率）约为71%；JAWS-M将平均ASR提升至约75%，其中32%为可运行的攻击代码。将LLM封装在代理中会使ASR提升1.6倍，原因是代理在规划和工具使用过程中推翻了初始拒绝响应。类似趋势在OpenHands、SWE-Agent和OpenAI Codex中同样成立，这表明JAWS-Bench是代理无关的。分类分析确定了最易受攻击且最具可部署性的攻击类别，这激励了可执行感知的防御机制以及保留拒绝能力的代理设计。