In the rapidly evolving landscape of Large Language Models (LLMs), ensuring robust safety measures is paramount. To meet this crucial need, we propose \emph{SALAD-Bench}, a safety benchmark specifically designed for evaluating LLMs, attack, and defense methods. Distinguished by its breadth, SALAD-Bench transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile functionalities.SALAD-Bench is crafted with a meticulous array of questions, from standard queries to complex ones enriched with attack, defense modifications and multiple-choice. To effectively manage the inherent complexity, we introduce an innovative evaluators: the LLM-based MD-Judge for QA pairs with a particular focus on attack-enhanced queries, ensuring a seamless, and reliable evaluation. Above components extend SALAD-Bench from standard LLM safety evaluation to both LLM attack and defense methods evaluation, ensuring the joint-purpose utility. Our extensive experiments shed light on the resilience of LLMs against emerging threats and the efficacy of contemporary defense tactics. Data and evaluator are released under https://github.com/OpenSafetyLab/SALAD-BENCH.
翻译:在大语言模型(LLMs)快速发展的背景下,确保稳健的安全措施至关重要。为满足这一关键需求,我们提出\emph{SALAD-Bench}——一个专为评估LLMs、攻击与防御方法设计的安全基准。不同于传统基准,SALAD-Bench凭借其大规模数据、丰富多样性、覆盖三层的精细分类体系以及多功能性脱颖而出。该基准精心构建了从标准查询到包含攻击、防御修改及多选题的复杂查询。为有效管理固有复杂性,我们引入创新评估器:基于LLM的MD-Judge,专门用于处理增强攻击的问答对,确保评估的流畅性与可靠性。上述组件将SALAD-Bench从标准LLM安全评估扩展至LLM攻击与防御方法的评估,实现多用途协同。大量实验揭示了LLMs对新威胁的抵御能力及当代防御策略的有效性。数据与评估器已发布于https://github.com/OpenSafetyLab/SALAD-BENCH。