Code-capable large language model (LLM) agents are embedded in software engineering workflows where they can read, write, and execute code, raising "jailbreak" stakes beyond text-only settings. Prior evaluations emphasize refusal or harmful-text detection, leaving open whether agents compile and run malicious programs. We present JAWS-Bench (Jailbreaks Across WorkSpaces), a benchmark spanning three escalating workspace regimes mirroring attacker capability: empty (JAWS-0), single-file (JAWS-1), and multi-file (JAWS-M). We pair this with a hierarchical, executable-aware Judge Framework that tests (i) compliance, (ii) attack success, (iii) syntactic correctness, and (iv) runtime executability, to measure deployable harm. Across seven LLM backends from five families, prompt-only attacks in JAWS-0 achieve 61% compliance; 58% are harmful, 52% parse, and 27% run end-to-end. In JAWS-1, compliance reaches ~100% for stronger models with a mean ASR (Attack Success Rate) ~71%; JAWS-M raises mean ASR to ~75%, with 32% runnable attack code. Wrapping an LLM in an agent increases ASR by 1.6$\times$, by overturning initial refusals during planning and tool use. Similar trends hold for OpenHands, SWE-Agent, and OpenAI Codex, suggesting our JAWS-Bench is agent-agnostic. Category analyses identify which attack classes are most vulnerable and deployable, motivating execution-aware defenses and refusal-preserving agent designs.
翻译:具备代码能力的大型语言模型(LLM)代理嵌入在软件工程工作流程中,能够读取、编写和执行代码,这使得“越狱”风险超越了纯文本场景。以往的评估侧重于拒绝响应或有害文本检测,而未探讨代理是否会编译并运行恶意程序。我们提出JAWS-Bench(工作空间越狱基准测试),该基准涵盖三个逐步升级的工作空间场景以映射攻击者能力:空工作空间(JAWS-0)、单文件工作空间(JAWS-1)和多文件工作空间(JAWS-M)。我们将其与分层的、可执行感知的判据框架相结合,通过测试(i)合规性、(ii)攻击成功率、(iii)语法正确性以及(iv)运行时可执行性,来评估可部署的危害。在来自五个家族的七个LLM后端的测试中,JAWS-0中的纯提示攻击实现了61%的合规性;其中58%具有危害性,52%可解析,27%可端到端运行。在JAWS-1中,对于更强模型,合规性达到约100%,平均ASR(攻击成功率)约为71%;JAWS-M将平均ASR提升至约75%,其中32%为可运行的攻击代码。将LLM封装在代理中会使ASR提升1.6倍,原因是代理在规划和工具使用过程中推翻了初始拒绝响应。类似趋势在OpenHands、SWE-Agent和OpenAI Codex中同样成立,这表明JAWS-Bench是代理无关的。分类分析确定了最易受攻击且最具可部署性的攻击类别,这激励了可执行感知的防御机制以及保留拒绝能力的代理设计。