Jailbreak techniques for large language models (LLMs) evolve faster than benchmarks, making robustness estimates stale and difficult to compare across papers due to drift in datasets, harnesses, and judging protocols. We introduce JAILBREAK FOUNDRY (JBF), a system that addresses this gap via a multi-agent workflow to translate jailbreak papers into executable modules for immediate evaluation within a unified harness. JBF features three core components: (i) JBF-LIB for shared contracts and reusable utilities; (ii) JBF-FORGE for the multi-agent paper-to-module translation; and (iii) JBF-EVAL for standardizing evaluations. Across 30 reproduced attacks, JBF achieves high fidelity with a mean (reproduced-reported) attack success rate (ASR) deviation of +0.26 percentage points. By leveraging shared infrastructure, JBF reduces attack-specific implementation code by nearly half relative to original repositories and achieves an 82.5% mean reused-code ratio. This system enables a standardized AdvBench evaluation of all 30 attacks across 10 victim models using a consistent GPT-4o judge. By automating both attack integration and standardized evaluation, JBF offers a scalable solution for creating living benchmarks that keep pace with the rapidly shifting security landscape.
翻译:大型语言模型(LLM)的越狱技术发展速度远超基准测试的更新,导致鲁棒性评估结果迅速过时,且因数据集、测试框架与评判标准的变化,不同论文间的结果难以直接比较。为此,我们提出了 JAILBREAK FOUNDRY(JBF)系统,该系统通过一个多智能体工作流来弥合这一差距,将越狱研究论文转化为可执行模块,以便在统一的测试框架中进行即时评估。JBF 包含三个核心组件:(i) JBF-LIB,用于共享接口约定与可复用工具;(ii) JBF-FORGE,负责通过多智能体实现从论文到模块的转化;(iii) JBF-EVAL,用于标准化评估流程。在复现的 30 个攻击方法中,JBF 实现了高保真度,其平均(复现结果-报告结果)攻击成功率(ASR)偏差仅为 +0.26 个百分点。通过共享基础设施,JBF 将攻击特定实现代码量相较于原始代码库减少了近一半,并达到了 82.5% 的平均代码复用率。该系统支持使用一致的 GPT-4o 评判器,对所有 30 种攻击在 10 个受害模型上进行了标准化的 AdvBench 评估。通过自动化攻击集成与标准化评估,JBF 为构建能够跟上快速演变的安全态势的动态基准测试,提供了一个可扩展的解决方案。