Safety evaluation of large language models (LLMs) increasingly relies on LLM-as-a-judge pipelines, but strong judges can still be expensive to use at scale. We study whether structured multi-agent debate can improve judge reliability while keeping backbone size and cost modest. To do so, we introduce HAJailBench, a human-annotated jailbreak benchmark with 11,100 labeled interactions spanning diverse attack methods and target models, and we pair it with a Multi-Agent Judge framework in which critic, defender, and judge agents debate under a shared safety rubric. On HAJailBench, the framework improves over matched small-model prompt baselines and prior multi-agent judges, while remaining more economical than GPT-4o under the evaluated pricing snapshot. Ablation results further show that a small number of debate rounds is sufficient to capture most of the gain. Together, these results support structured, value-aligned debate as a practical design for scalable LLM safety evaluation.
翻译:大语言模型(LLM)的安全性评估日益依赖于LLM即评判员(LLM-as-a-judge)的评估流程,但即使采用性能强大的评判模型,在大规模应用时仍可能面临高昂成本。本研究探讨结构化多智能体辩论是否能在保持骨干模型规模与成本可控的同时,提升评判的可靠性。为此,我们提出了HAJailBench——一个包含11,100条标注交互数据的人工标注越狱基准数据集,涵盖多样化的攻击方法与目标模型;并配套构建了多智能体评判框架,其中批评者、辩护者与评判员智能体在统一的安全准则下展开辩论。在HAJailBench上的实验表明,该框架优于匹配的小模型提示基线及先前的多智能体评判方法,同时在所评估的定价快照下仍比GPT-4o更具经济性。消融实验结果进一步表明,仅需少量辩论轮次即可获得大部分性能增益。综合而言,这些结果支持结构化、价值对齐的辩论机制可作为可扩展LLM安全性评估的实用设计方案。