Efficient LLM Safety Evaluation through Multi-Agent Debate - 专知论文

会员服务 ·

0

安全性评估 · 语言模型 · 多智能体辩论 · 模型安全 · 结构 ·

2025 年 11 月 9 日

Efficient LLM Safety Evaluation through Multi-Agent Debate

翻译：通过多智能体辩论实现高效的大型语言模型安全性评估

Dachuan Lin,Guobin Shen,Zihao Yang,Tianrong Liu,Dongcheng Zhao,Yi Zeng

from arxiv, 9 pages of main text, 14 pages total, 4 figures

Safety evaluation of large language models (LLMs) increasingly relies on LLM-as-a-Judge frameworks, but the high cost of frontier models limits scalability. We propose a cost-efficient multi-agent judging framework that employs Small Language Models (SLMs) through structured debates among critic, defender, and judge agents. To rigorously assess safety judgments, we construct HAJailBench, a large-scale human-annotated jailbreak benchmark comprising 12,000 adversarial interactions across diverse attack methods and target models. The dataset provides fine-grained, expert-labeled ground truth for evaluating both safety robustness and judge reliability. Our SLM-based framework achieves agreement comparable to GPT-4o judges on HAJailBench while substantially reducing inference cost. Ablation results show that three rounds of debate yield the optimal balance between accuracy and efficiency. These findings demonstrate that structured, value-aligned debate enables SLMs to capture semantic nuances of jailbreak attacks and that HAJailBench offers a reliable foundation for scalable LLM safety evaluation.

翻译：大型语言模型（LLMs）的安全性评估日益依赖于LLM-as-a-Judge框架，但前沿模型的高昂成本限制了其可扩展性。我们提出了一种成本效益高的多智能体评判框架，通过批评者、辩护者和法官智能体之间的结构化辩论，利用小型语言模型（SLMs）进行评判。为了严格评估安全性判断，我们构建了HAJailBench——一个大规模人工标注的越狱基准数据集，包含跨多种攻击方法和目标模型的12,000次对抗性交互。该数据集提供了细粒度、专家标注的真实标签，用于评估安全鲁棒性和法官可靠性。我们的基于SLM的框架在HAJailBench上实现了与GPT-4o法官相当的判断一致性，同时显著降低了推理成本。消融实验结果表明，三轮辩论在准确性和效率之间达到了最佳平衡。这些发现表明，结构化、价值对齐的辩论使SLMs能够捕捉越狱攻击的语义细微差别，且HAJailBench为可扩展的LLM安全性评估提供了可靠的基础。

0

相关内容

安全性评估

安全性评估

DeepSeek模型综述：V1 V2 V3 R1-Zero

DeepSeek模型综述：V1 V2 V3 R1-Zero

专知会员服务

116+阅读 · 2025年2月11日

【ICML2024】PrE-Text：在大规模语言模型（LLM）时代对私人联邦数据进行语言模型训练

【ICML2024】PrE-Text：在大规模语言模型（LLM）时代对私人联邦数据进行语言模型训练

专知会员服务

19+阅读 · 2024年6月6日

UTC: 用于视觉对话的任务间对比学习的统一Transformer

UTC: 用于视觉对话的任务间对比学习的统一Transformer

专知会员服务

14+阅读 · 2022年5月4日

【CVPR 2022】长尾视觉数据识别的嵌套式协同学习方法 Nested Collaborative Learning for Long-Tailed Visual Recognition

【CVPR 2022】长尾视觉数据识别的嵌套式协同学习方法 Nested Collaborative Learning for Long-Tailed Visual Recognition

专知会员服务

13+阅读 · 2022年3月19日

【CVPR 2022】基于视觉-语言验证和迭代推理的视觉定位,Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation

【CVPR 2022】基于视觉-语言验证和迭代推理的视觉定位,Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation

专知会员服务

12+阅读 · 2022年3月19日

【伯克利JD Co-Reyes博士论文】建立强化学习算法泛化:从潜在动力学模型到元学习，Building Reinforcement Learning Algorithms that Generalize: From Latent Dynamics Models to Meta-Learning

【伯克利JD Co-Reyes博士论文】建立强化学习算法泛化:从潜在动力学模型到元学习，Building Reinforcement Learning Algorithms that Generalize: From Latent Dynamics Models to Meta-Learning

专知会员服务

45+阅读 · 2022年3月6日

【ACL2020-CMU-Google】MobileBERT:用于资源受限设备的任务无关“瘦版”BERT

【ACL2020-CMU-Google】MobileBERT:用于资源受限设备的任务无关“瘦版”BERT

专知会员服务

13+阅读 · 2020年4月9日

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

专知会员服务

21+阅读 · 2020年3月28日

【ICML2020投稿论文-CMU-DeepMind-Google】用于评估跨语言泛化的大规模多语言多任务基准

【ICML2020投稿论文-CMU-DeepMind-Google】用于评估跨语言泛化的大规模多语言多任务基准

专知会员服务

14+阅读 · 2020年3月27日

【Facebook AI】对抗性NLI:自然语言理解的新基准，Adversarial NLI: A New Benchmark for Natural Language Understanding

【Facebook AI】对抗性NLI:自然语言理解的新基准，Adversarial NLI: A New Benchmark for Natural Language Understanding

专知会员服务

11+阅读 · 2019年11月2日

最新最全《深度元学习》2021综述论文，68页pdf，A Survey of Deep Meta-Learning

最新最全《深度元学习》2021综述论文，68页pdf，A Survey of Deep Meta-Learning

专知

11+阅读 · 2021年4月23日

【KDD2020-Tutorial】深度学习异常检测，180页ppt

【KDD2020-Tutorial】深度学习异常检测，180页ppt

专知

49+阅读 · 2020年8月28日

【CIKM2020】多模态知识图谱推荐系统，Multi-modal KG for RS

【CIKM2020】多模态知识图谱推荐系统，Multi-modal KG for RS

专知

33+阅读 · 2020年8月24日

Python图像处理，366页pdf，Image Operators Image Processing in Python

Python图像处理，366页pdf，Image Operators Image Processing in Python

专知

15+阅读 · 2020年7月23日

【ICML2020】多视角对比图表示学习，Contrastive Multi-View GRL

【ICML2020】多视角对比图表示学习，Contrastive Multi-View GRL

专知

37+阅读 · 2020年6月11日

【CVPR2020-牛津-谷歌】语音到动作:动作识别的跨模态监督，Cross-modal Supervision

【CVPR2020-牛津-谷歌】语音到动作:动作识别的跨模态监督，Cross-modal Supervision

专知

10+阅读 · 2020年3月31日

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图与推荐

10+阅读 · 2020年3月28日

【WWW2020-新加坡国立大学】知识图谱强化负采样的推荐系统，Reinforced Negative Sampling

【WWW2020-新加坡国立大学】知识图谱强化负采样的推荐系统，Reinforced Negative Sampling

专知

22+阅读 · 2020年3月14日

如何用机器学习精准辨别“背景”和“目标”

如何用机器学习精准辨别“背景”和“目标”

论智

10+阅读 · 2018年10月22日

Single-Shot Object Detection with Enriched Semantics

Single-Shot Object Detection with Enriched Semantics

统计学习与视觉计算组

14+阅读 · 2018年8月29日

语义Web知识库补全关键技术研究

国家自然科学基金

18+阅读 · 2017年12月31日

基于抽象语义切片和后向求精分析的静态分析警报自动确认研究

国家自然科学基金

1+阅读 · 2015年12月31日

组合测试用例优先排序算法及选择策略研究

国家自然科学基金

9+阅读 · 2015年12月31日

“数据-知识”驱动的大区域高分辨率遥感影像多尺度分割并行计算方法

国家自然科学基金

0+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

47+阅读 · 2015年12月31日

基于代数规约的Web服务在线测试理论和技术研究

国家自然科学基金

0+阅读 · 2015年12月31日

PPP项目争端谈判及其治理机制研究

国家自然科学基金

2+阅读 · 2015年12月31日

大数据环境下基于GMDH的客户分类半监督集成模型研究

国家自然科学基金

1+阅读 · 2014年12月31日

基于自适应模型检测的安全协议自动建模与设计研究

国家自然科学基金

1+阅读 · 2014年12月31日

基于组合Hodge理论的图像视频质量评价方法

国家自然科学基金

0+阅读 · 2014年12月31日

DeepSeek-V3 Technical Report

Arxiv

18+阅读 · 2024年12月27日

Is ChatGPT a Good Recommender? A Preliminary Study

Arxiv

176+阅读 · 2023年4月20日

NeuralField-LDM: Scene Generation with Hierarchical Latent Diffusion Models

Arxiv

43+阅读 · 2023年4月19日

A Comprehensive Survey on Deep Graph Representation Learning

Arxiv

111+阅读 · 2023年4月11日

On Efficient Training of Large-Scale Deep Learning Models: A Literature Review

Arxiv

231+阅读 · 2023年4月7日

A Survey on Graph Diffusion Models: Generative AI in Science for Molecule, Protein and Material

Arxiv

87+阅读 · 2023年4月4日

A Survey of Large Language Models

A Survey of Large Language Models

Arxiv

501+阅读 · 2023年3月31日

ChatGPT is a Knowledgeable but Inexperienced Solver: An Investigation of Commonsense Problem in Large Language Models

Arxiv

64+阅读 · 2023年3月29日

Nature Language Reasoning, A Survey

Arxiv

83+阅读 · 2023年3月26日

Data-centric Artificial Intelligence: A Survey

Arxiv

27+阅读 · 2023年3月17日

VIP会员

文章信息

相关主题

安全性评估

多智能体辩论

最新内容

2025年大语言模型进展报告

2025年大语言模型进展报告

专知会员服务

9+阅读 · 4月25日

多智能体协作机制

多智能体协作机制

专知会员服务

8+阅读 · 4月25日

非对称优势：美海军开发低成本反无人机技术

非对称优势：美海军开发低成本反无人机技术

专知会员服务

9+阅读 · 4月25日

《反无人机技术领域的技术发展综述：C-UAS探测、跟踪与识别技术》80页报告

《反无人机技术领域的技术发展综述：C-UAS探测、跟踪与识别技术》80页报告

专知会员服务

19+阅读 · 4月25日

《美战争部小企业创新研究（SBIR）计划》

《美战争部小企业创新研究（SBIR）计划》

专知会员服务

8+阅读 · 4月25日

《军事模拟：将军事条令与目标融入AI智能体》

《军事模拟：将军事条令与目标融入AI智能体》

专知会员服务

12+阅读 · 4月25日

【NTU博士论文】3D人体动作生成

【NTU博士论文】3D人体动作生成

专知会员服务

9+阅读 · 4月24日

DeepSeek-V4：百万 Token 上下文背后，大模型正在进入“长程智能”时代（附中英文pdf版）

DeepSeek-V4：百万 Token 上下文背后，大模型正在进入“长程智能”时代（附中英文pdf版）

专知会员服务

12+阅读 · 4月24日

以色列军事技术对美国军力发展的持续性赋能

以色列军事技术对美国军力发展的持续性赋能

专知会员服务

9+阅读 · 4月24日

战场之外的较量：美伊冲突中的认知战与心理博弈

战场之外的较量：美伊冲突中的认知战与心理博弈

专知会员服务

7+阅读 · 4月24日

俄乌战争中乌克兰防空能力演变与见解（中文版）

俄乌战争中乌克兰防空能力演变与见解（中文版）

专知会员服务

8+阅读 · 4月24日

《面向巡飞弹药系统的情境感知深度强化学习自主非线性机动控制》

《面向巡飞弹药系统的情境感知深度强化学习自主非线性机动控制》

专知会员服务

11+阅读 · 4月24日

《深度强化学习在兵棋推演中的应用》40页报告

《深度强化学习在兵棋推演中的应用》40页报告

专知会员服务

16+阅读 · 4月24日

《多域作战面临复杂现实》

《多域作战面临复杂现实》

专知会员服务

13+阅读 · 4月24日

《印度的多域作战：条令与能力发展》报告

《印度的多域作战：条令与能力发展》报告

专知会员服务

6+阅读 · 4月24日

相关VIP内容

DeepSeek模型综述：V1 V2 V3 R1-Zero

DeepSeek模型综述：V1 V2 V3 R1-Zero

专知会员服务

116+阅读 · 2025年2月11日

【ICML2024】PrE-Text：在大规模语言模型（LLM）时代对私人联邦数据进行语言模型训练

【ICML2024】PrE-Text：在大规模语言模型（LLM）时代对私人联邦数据进行语言模型训练

专知会员服务

19+阅读 · 2024年6月6日

UTC: 用于视觉对话的任务间对比学习的统一Transformer

UTC: 用于视觉对话的任务间对比学习的统一Transformer

专知会员服务

14+阅读 · 2022年5月4日

【CVPR 2022】长尾视觉数据识别的嵌套式协同学习方法 Nested Collaborative Learning for Long-Tailed Visual Recognition

【CVPR 2022】长尾视觉数据识别的嵌套式协同学习方法 Nested Collaborative Learning for Long-Tailed Visual Recognition

专知会员服务

13+阅读 · 2022年3月19日

【CVPR 2022】基于视觉-语言验证和迭代推理的视觉定位,Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation

【CVPR 2022】基于视觉-语言验证和迭代推理的视觉定位,Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation

专知会员服务

12+阅读 · 2022年3月19日

【伯克利JD Co-Reyes博士论文】建立强化学习算法泛化:从潜在动力学模型到元学习，Building Reinforcement Learning Algorithms that Generalize: From Latent Dynamics Models to Meta-Learning

【伯克利JD Co-Reyes博士论文】建立强化学习算法泛化:从潜在动力学模型到元学习，Building Reinforcement Learning Algorithms that Generalize: From Latent Dynamics Models to Meta-Learning

专知会员服务

45+阅读 · 2022年3月6日

【ACL2020-CMU-Google】MobileBERT:用于资源受限设备的任务无关“瘦版”BERT

【ACL2020-CMU-Google】MobileBERT:用于资源受限设备的任务无关“瘦版”BERT

专知会员服务

13+阅读 · 2020年4月9日

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

专知会员服务

21+阅读 · 2020年3月28日

【ICML2020投稿论文-CMU-DeepMind-Google】用于评估跨语言泛化的大规模多语言多任务基准

【ICML2020投稿论文-CMU-DeepMind-Google】用于评估跨语言泛化的大规模多语言多任务基准

专知会员服务

14+阅读 · 2020年3月27日

【Facebook AI】对抗性NLI:自然语言理解的新基准，Adversarial NLI: A New Benchmark for Natural Language Understanding

【Facebook AI】对抗性NLI:自然语言理解的新基准，Adversarial NLI: A New Benchmark for Natural Language Understanding

专知会员服务

11+阅读 · 2019年11月2日

热门VIP内容

开通专知VIP会员享更多权益服务

多智能体协作机制

《反无人机技术领域的技术发展综述：C-UAS探测、跟踪与识别技术》80页报告

2025年大语言模型进展报告

非对称优势：美海军开发低成本反无人机技术

相关资讯

最新最全《深度元学习》2021综述论文，68页pdf，A Survey of Deep Meta-Learning

最新最全《深度元学习》2021综述论文，68页pdf，A Survey of Deep Meta-Learning

专知

11+阅读 · 2021年4月23日

【KDD2020-Tutorial】深度学习异常检测，180页ppt

【KDD2020-Tutorial】深度学习异常检测，180页ppt

专知

49+阅读 · 2020年8月28日

【CIKM2020】多模态知识图谱推荐系统，Multi-modal KG for RS

【CIKM2020】多模态知识图谱推荐系统，Multi-modal KG for RS

专知

33+阅读 · 2020年8月24日

Python图像处理，366页pdf，Image Operators Image Processing in Python

Python图像处理，366页pdf，Image Operators Image Processing in Python

专知

15+阅读 · 2020年7月23日

【ICML2020】多视角对比图表示学习，Contrastive Multi-View GRL

【ICML2020】多视角对比图表示学习，Contrastive Multi-View GRL

专知

37+阅读 · 2020年6月11日

【CVPR2020-牛津-谷歌】语音到动作:动作识别的跨模态监督，Cross-modal Supervision

【CVPR2020-牛津-谷歌】语音到动作:动作识别的跨模态监督，Cross-modal Supervision

专知

10+阅读 · 2020年3月31日

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图与推荐

10+阅读 · 2020年3月28日

【WWW2020-新加坡国立大学】知识图谱强化负采样的推荐系统，Reinforced Negative Sampling

【WWW2020-新加坡国立大学】知识图谱强化负采样的推荐系统，Reinforced Negative Sampling

专知

22+阅读 · 2020年3月14日

如何用机器学习精准辨别“背景”和“目标”

如何用机器学习精准辨别“背景”和“目标”

论智

10+阅读 · 2018年10月22日

Single-Shot Object Detection with Enriched Semantics

Single-Shot Object Detection with Enriched Semantics

统计学习与视觉计算组

14+阅读 · 2018年8月29日

相关论文

DeepSeek-V3 Technical Report

Arxiv

18+阅读 · 2024年12月27日

Is ChatGPT a Good Recommender? A Preliminary Study

Arxiv

176+阅读 · 2023年4月20日

NeuralField-LDM: Scene Generation with Hierarchical Latent Diffusion Models

Arxiv

43+阅读 · 2023年4月19日

A Comprehensive Survey on Deep Graph Representation Learning

Arxiv

111+阅读 · 2023年4月11日

On Efficient Training of Large-Scale Deep Learning Models: A Literature Review

Arxiv

231+阅读 · 2023年4月7日

A Survey on Graph Diffusion Models: Generative AI in Science for Molecule, Protein and Material

Arxiv

87+阅读 · 2023年4月4日

A Survey of Large Language Models

A Survey of Large Language Models

Arxiv

501+阅读 · 2023年3月31日

ChatGPT is a Knowledgeable but Inexperienced Solver: An Investigation of Commonsense Problem in Large Language Models

Arxiv

64+阅读 · 2023年3月29日

Nature Language Reasoning, A Survey

Arxiv

83+阅读 · 2023年3月26日

Data-centric Artificial Intelligence: A Survey

Arxiv

27+阅读 · 2023年3月17日

相关基金

语义Web知识库补全关键技术研究

国家自然科学基金

18+阅读 · 2017年12月31日

基于抽象语义切片和后向求精分析的静态分析警报自动确认研究

国家自然科学基金

1+阅读 · 2015年12月31日

组合测试用例优先排序算法及选择策略研究

国家自然科学基金

9+阅读 · 2015年12月31日

“数据-知识”驱动的大区域高分辨率遥感影像多尺度分割并行计算方法

国家自然科学基金

0+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

47+阅读 · 2015年12月31日

基于代数规约的Web服务在线测试理论和技术研究

国家自然科学基金

0+阅读 · 2015年12月31日

PPP项目争端谈判及其治理机制研究

国家自然科学基金

2+阅读 · 2015年12月31日

大数据环境下基于GMDH的客户分类半监督集成模型研究

国家自然科学基金

1+阅读 · 2014年12月31日

基于自适应模型检测的安全协议自动建模与设计研究

国家自然科学基金

1+阅读 · 2014年12月31日

基于组合Hodge理论的图像视频质量评价方法

国家自然科学基金

0+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员