Jailbreak techniques for large language models (LLMs) evolve faster than benchmarks, making robustness estimates stale and difficult to compare across papers due to drift in datasets, harnesses, and judging protocols. We introduce JAILBREAK FOUNDRY (JBF), a system that addresses this gap via a multi-agent workflow to translate jailbreak papers into executable modules for immediate evaluation within a unified harness. JBF features three core components: (i) JBF-LIB for shared contracts and reusable utilities; (ii) JBF-FORGE for the multi-agent paper-to-module translation; and (iii) JBF-EVAL for standardizing evaluations. Across 30 reproduced attacks, JBF achieves high fidelity with a mean (reproduced-reported) attack success rate (ASR) deviation of +0.26 percentage points. By leveraging shared infrastructure, JBF reduces attack-specific implementation code by nearly half relative to original repositories and achieves an 82.5% mean reused-code ratio. This system enables a standardized AdvBench evaluation of all 30 attacks across 10 victim models using a consistent GPT-4o judge. By automating both attack integration and standardized evaluation, JBF offers a scalable solution for creating living benchmarks that keep pace with the rapidly shifting security landscape.
翻译:大型语言模型(LLM)的越狱技术发展速度远超基准测试的更新,导致鲁棒性评估结果迅速过时,且因数据集、测试框架与评判协议的差异,不同论文间的结果难以直接比较。本文提出JAILBREAK FOUNDRY(JBF)系统,通过多智能体工作流将越狱论文转化为可执行模块,实现在统一测试框架中的即时评估。JBF包含三个核心组件:(i)JBF-LIB提供共享契约与可复用工具;(ii)JBF-FORGE实现多智能体驱动的论文到模块转化;(iii)JBF-EVAL标准化评估流程。在复现的30种攻击方法中,JBF展现出高保真度,其(复现值-报告值)攻击成功率(ASR)的平均偏差仅为+0.26个百分点。通过共享基础设施,JBF将攻击专用实现代码量较原始代码库减少近半,平均代码复用率达到82.5%。该系统支持使用一致的GPT-4o评判器,对10个受害模型进行所有30种攻击的标准化AdvBench评估。通过自动化攻击集成与标准化评估,JBF为构建动态基准测试提供了可扩展的解决方案,使其能够跟上快速演化的安全态势。