Jailbreak techniques for large language models (LLMs) evolve faster than benchmarks, making robustness estimates stale and difficult to compare across papers due to drift in datasets, harnesses, and judging protocols. We introduce JAILBREAK FOUNDRY (JBF), a system that addresses this gap via a multi-agent workflow to translate jailbreak papers into executable modules for immediate evaluation within a unified harness. JBF features three core components: (i) JBF-LIB for shared contracts and reusable utilities; (ii) JBF-FORGE for the multi-agent paper-to-module translation; and (iii) JBF-EVAL for standardizing evaluations. Across 30 reproduced attacks, JBF achieves high fidelity with a mean (reproduced-reported) attack success rate (ASR) deviation of +0.26 percentage points. By leveraging shared infrastructure, JBF reduces attack-specific implementation code by nearly half relative to original repositories and achieves an 82.5% mean reused-code ratio. This system enables a standardized AdvBench evaluation of all 30 attacks across 10 victim models using a consistent GPT-4o judge. By automating both attack integration and standardized evaluation, JBF offers a scalable solution for creating living benchmarks that keep pace with the rapidly shifting security landscape.
翻译:大型语言模型(LLM)的越狱技术发展速度远超基准测试的更新,导致鲁棒性评估结果迅速过时,且因数据集、测试框架与评判协议的不一致,难以在不同论文间进行有效比较。本文提出JAILBREAK FOUNDRY(JBF)系统,通过多智能体工作流将越狱论文转化为可执行模块,实现在统一测试框架中的即时评估。JBF包含三个核心组件:(i)JBF-LIB:提供共享契约与可复用工具库;(ii)JBF-FORGE:执行多智能体驱动的论文到模块转化流程;(iii)JBF-EVAL:实现标准化评估协议。在复现的30个攻击方法中,JBF展现出高保真度,其(复现值-报告值)攻击成功率(ASR)偏差均值仅为+0.26个百分点。通过共享基础设施,JBF将攻击特定实现代码量较原始代码库减少近半,平均代码复用率达到82.5%。该系统基于统一的GPT-4o评判器,对10个受害模型完成了全部30种攻击的标准化AdvBench评估。通过自动化攻击集成与标准化评估流程,JBF为构建动态演进的基准测试提供了可扩展解决方案,使其能够持续适应快速演变的安全态势。