In the rapidly evolving landscape of Large Language Models (LLMs), ensuring robust safety measures is paramount. To meet this crucial need, we propose \emph{SALAD-Bench}, a safety benchmark specifically designed for evaluating LLMs, attack, and defense methods. Distinguished by its breadth, SALAD-Bench transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile functionalities.SALAD-Bench is crafted with a meticulous array of questions, from standard queries to complex ones enriched with attack, defense modifications and multiple-choice. To effectively manage the inherent complexity, we introduce an innovative evaluators: the LLM-based MD-Judge for QA pairs with a particular focus on attack-enhanced queries, ensuring a seamless, and reliable evaluation. Above components extend SALAD-Bench from standard LLM safety evaluation to both LLM attack and defense methods evaluation, ensuring the joint-purpose utility. Our extensive experiments shed light on the resilience of LLMs against emerging threats and the efficacy of contemporary defense tactics. Data and evaluator are released under https://github.com/OpenSafetyLab/SALAD-BENCH.
翻译:在大语言模型快速发展的背景下,确保其安全性措施稳健可靠至关重要。为应对这一关键需求,我们提出SALAD-Bench——一个专为大语言模型评估、攻击与防御方法设计的安全基准测试。该基准凭借大规模覆盖、丰富多样性、跨三个层级的精细分类体系以及多功能特性,超越了传统基准。SALAD-Bench精心构建了从标准查询到融合攻击与防御变体的复杂多选问题在内的全量题库。为有效管理固有复杂性,我们引入创新评估器MD-Judge:基于大语言模型设计的问答对评估器,特别针对攻击增强型查询,确保评估流程的连贯性与可靠性。上述组件将SALAD-Bench从标准大语言模型安全评估拓展至攻击与防御方法评估,实现多用途联合评估。大量实验揭示了当前大语言模型对新兴威胁的抵御能力及防御策略的有效性。数据集与评估器已发布于https://github.com/OpenSafetyLab/SALAD-BENCH。