In the rapidly evolving landscape of Large Language Models (LLMs), ensuring robust safety measures is paramount. To meet this crucial need, we propose \emph{SALAD-Bench}, a safety benchmark specifically designed for evaluating LLMs, attack, and defense methods. Distinguished by its breadth, SALAD-Bench transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile functionalities.SALAD-Bench is crafted with a meticulous array of questions, from standard queries to complex ones enriched with attack, defense modifications and multiple-choice. To effectively manage the inherent complexity, we introduce an innovative evaluators: the LLM-based MD-Judge for QA pairs with a particular focus on attack-enhanced queries, ensuring a seamless, and reliable evaluation. Above components extend SALAD-Bench from standard LLM safety evaluation to both LLM attack and defense methods evaluation, ensuring the joint-purpose utility. Our extensive experiments shed light on the resilience of LLMs against emerging threats and the efficacy of contemporary defense tactics. Data and evaluator are released under https://github.com/OpenSafetyLab/SALAD-BENCH.
翻译:在大语言模型(LLMs)快速发展的背景下,确保稳健的安全措施至关重要。为满足这一关键需求,我们提出了\emph{SALAD-Bench}——一个专门用于评估LLMs、攻击与防御方法的安全基准测试。区别于传统基准测试,SALAD-Bench凭借其大规模数据、丰富多样性、横跨三层的精细分类体系以及多功能性脱颖而出。该基准测试精心设计了从标准查询到包含攻击、防御变体及多选题的复杂查询的系列问题。为有效管理固有复杂性,我们引入创新评估器:基于LLM的MD-Judge,专用于处理增强攻击的问答对,确保评估过程的流畅性与可靠性。上述组件将SALAD-Bench从标准LLM安全评估扩展至LLM攻击与防御方法的联合评估,实现了多用途效能。大量实验揭示了LLMs面对新兴威胁的鲁棒性以及当代防御策略的有效性。数据集与评估器已发布于https://github.com/OpenSafetyLab/SALAD-BENCH。