HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Automated red teaming holds substantial promise for uncovering and mitigating the risks associated with the malicious use of large language models (LLMs), yet the field lacks a standardized evaluation framework to rigorously assess new methods. To address this issue, we introduce HarmBench, a standardized evaluation framework for automated red teaming. We identify several desirable properties previously unaccounted for in red teaming evaluations and systematically design HarmBench to meet these criteria. Using HarmBench, we conduct a large-scale comparison of 18 red teaming methods and 33 target LLMs and defenses, yielding novel insights. We also introduce a highly efficient adversarial training method that greatly enhances LLM robustness across a wide range of attacks, demonstrating how HarmBench enables codevelopment of attacks and defenses. We open source HarmBench at https://github.com/centerforaisafety/HarmBench.

翻译：自动化红队测试在揭示和缓解大型语言模型（LLM）被恶意利用的风险方面具有重要前景，然而该领域目前缺乏一个标准化的评估框架来严格评估新方法。为解决这一问题，我们提出了HarmBench——一个面向自动化红队测试的标准化评估框架。我们识别了以往红队测试评估中未考虑的若干理想特性，并系统性地设计了HarmBench以满足这些标准。借助HarmBench，我们对18种红队测试方法与33个目标LLM及防御策略进行了大规模比较，获得了一系列新颖的见解。此外，我们引入了一种高效的对抗训练方法，该方法能显著增强LLM在广泛攻击下的鲁棒性，展示了HarmBench如何促进攻击与防御的协同发展。我们已在https://github.com/centerforaisafety/HarmBench上开源HarmBench。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日