Large Language Models have gained considerable attention for their revolutionary capabilities. However, there is also growing concern on their safety implications, making a comprehensive safety evaluation for LLMs urgently needed before model deployment. In this work, we propose S-Eval, a new comprehensive, multi-dimensional and open-ended safety evaluation benchmark. At the core of S-Eval is a novel LLM-based automatic test prompt generation and selection framework, which trains an expert testing LLM Mt combined with a range of test selection strategies to automatically construct a high-quality test suite for the safety evaluation. The key to the automation of this process is a novel expert safety-critique LLM Mc able to quantify the riskiness score of a LLM's response, and additionally produce risk tags and explanations. Besides, the generation process is also guided by a carefully designed risk taxonomy with four different levels, covering comprehensive and multi-dimensional safety risks of concern. Based on these, we systematically construct a new and large-scale safety evaluation benchmark for LLMs consisting of 220,000 evaluation prompts, including 20,000 base risk prompts (10,000 in Chinese and 10,000 in English) and 200, 000 corresponding attack prompts derived from 10 popular adversarial instruction attacks against LLMs. Moreover, considering the rapid evolution of LLMs and accompanied safety threats, S-Eval can be flexibly configured and adapted to include new risks, attacks and models. S-Eval is extensively evaluated on 20 popular and representative LLMs. The results confirm that S-Eval can better reflect and inform the safety risks of LLMs compared to existing benchmarks. We also explore the impacts of parameter scales, language environments, and decoding parameters on the evaluation, providing a systematic methodology for evaluating the safety of LLMs.
翻译:大语言模型因其革命性能力而备受关注。然而,其安全性影响也日益引发担忧,使得在模型部署前对LLM进行全面安全评估变得尤为迫切。本研究提出S-Eval——一个全新、全面、多维度且开放式的安全评估基准。S-Eval的核心在于一种基于LLM的自动测试提示生成与选择框架,该框架通过训练专家测试LLM Mt并结合多种测试选择策略,自动构建用于安全评估的高质量测试集。该流程自动化的关键在于新型专家安全评判LLM Mc,其能量化LLM响应的风险评分,并额外生成风险标签与解释说明。此外,生成过程还受到精心设计的四级风险分类体系引导,覆盖了全面且多维度的安全风险关注点。基于此,我们系统构建了包含22万条评估提示的大规模LLM安全评估基准,其中包括2万条基础风险提示(中英文各1万条)以及通过对LLM实施10种主流对抗性指令攻击衍生的20万条对应攻击提示。考虑到LLM的快速演进及伴随的安全威胁,S-Eval可灵活配置与适配以纳入新风险、新攻击及新模型。我们在20个主流代表性LLM上对S-Eval进行了广泛评估。结果表明,相较于现有基准,S-Eval能更准确地反映并揭示LLM的安全风险。我们还探究了参数量级、语言环境及解码参数对评估的影响,为LLM安全评估提供了系统化方法论。