Large Language Models have gained considerable attention for their revolutionary capabilities. However, there is also growing concern on their safety implications, making a comprehensive safety evaluation for LLMs urgently needed before model deployment. In this work, we propose S-Eval, a new comprehensive, multi-dimensional and open-ended safety evaluation benchmark. At the core of S-Eval is a novel LLM-based automatic test prompt generation and selection framework, which trains an expert testing LLM Mt combined with a range of test selection strategies to automatically construct a high-quality test suite for the safety evaluation. The key to the automation of this process is a novel expert safety-critique LLM Mc able to quantify the riskiness score of an LLM's response, and additionally produce risk tags and explanations. Besides, the generation process is also guided by a carefully designed risk taxonomy with four different levels, covering comprehensive and multi-dimensional safety risks of concern. Based on these, we systematically construct a new and large-scale safety evaluation benchmark for LLMs consisting of 220,000 evaluation prompts, including 20,000 base risk prompts (10,000 in Chinese and 10,000 in English) and 200,000 corresponding attack prompts derived from 10 popular adversarial instruction attacks against LLMs. Moreover, considering the rapid evolution of LLMs and accompanied safety threats, S-Eval can be flexibly configured and adapted to include new risks, attacks and models. S-Eval is extensively evaluated on 20 popular and representative LLMs. The results confirm that S-Eval can better reflect and inform the safety risks of LLMs compared to existing benchmarks. We also explore the impacts of parameter scales, language environments, and decoding parameters on the evaluation, providing a systematic methodology for evaluating the safety of LLMs.
翻译:大型语言模型因其革命性能力而受到广泛关注。然而,对其安全影响的担忧也日益增长,使得在模型部署前对LLM进行全面安全评估变得尤为迫切。本研究提出S-Eval——一个全新、全面、多维度且开放式的安全评估基准。S-Eval的核心是一个基于LLM的新型自动测试提示生成与选择框架,该框架通过训练专家测试LLM Mt并结合多种测试选择策略,自动构建用于安全评估的高质量测试集。该过程自动化的关键在于一个能够量化LLM响应风险评分、并额外生成风险标签与解释的新型专家安全评判LLM Mc。此外,生成过程还受到精心设计的四级风险分类体系指导,覆盖了全面且多维度的安全风险关切。基于此,我们系统构建了一个全新的大规模LLM安全评估基准,包含220,000个评估提示,其中包含20,000个基础风险提示(10,000个中文提示与10,000个英文提示)以及通过对LLM实施10种主流对抗性指令攻击衍生的200,000个对应攻击提示。考虑到LLM的快速演进及伴随的安全威胁,S-Eval可灵活配置与适配,以纳入新的风险类型、攻击方式和模型。我们在20个主流代表性LLM上对S-Eval进行了广泛评估。结果表明,与现有基准相比,S-Eval能更准确地反映并揭示LLM的安全风险。我们还探究了参数规模、语言环境及解码参数对评估的影响,为LLM安全评估提供了系统化的方法论。