In the rapidly evolving landscape of Large Language Models (LLMs), ensuring robust safety measures is paramount. To meet this crucial need, we propose \emph{SALAD-Bench}, a safety benchmark specifically designed for evaluating LLMs, attack, and defense methods. Distinguished by its breadth, SALAD-Bench transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile functionalities.SALAD-Bench is crafted with a meticulous array of questions, from standard queries to complex ones enriched with attack, defense modifications and multiple-choice. To effectively manage the inherent complexity, we introduce an innovative evaluators: the LLM-based MD-Judge for QA pairs with a particular focus on attack-enhanced queries, ensuring a seamless, and reliable evaluation. Above components extend SALAD-Bench from standard LLM safety evaluation to both LLM attack and defense methods evaluation, ensuring the joint-purpose utility. Our extensive experiments shed light on the resilience of LLMs against emerging threats and the efficacy of contemporary defense tactics. Data and evaluator are released under \url{https://github.com/OpenSafetyLab/SALAD-BENCH}. Warning: this paper includes examples that may be offensive or harmful.
翻译:在大型语言模型快速发展的背景下,确保其安全措施稳健至关重要。为满足这一关键需求,我们提出SALAD-Bench——一个专为评估大型语言模型、攻击方法与防御技术而设计的安全基准。该基准以其广度著称,通过大规模数据量、丰富多样性、覆盖三级层次的精细分类体系以及多功能特性,超越了传统基准的局限。SALAD-Bench精心构建了从标准查询到融入攻击与防御变体及多选题的复杂问题集。为有效管理其固有复杂度,我们引入创新评估器:基于LLM的MD-Judge,专门针对攻击增强型问答对进行可靠性评估,确保评估流程无缝且可靠。上述组件将SALAD-Bench从标准LLM安全评估拓展至LLM攻击与防御方法的联合评估,实现了多用途效用。大量实验揭示了LLM面对新型威胁的抵御能力,以及当代防御策略的有效性。数据集与评估工具已开源发布于https://github.com/OpenSafetyLab/SALAD-BENCH。警告:本文包含可能引发不适或具有危害性的示例。