With the rapid advancement of Large Language Models (LLMs), the safety of LLMs has been a critical concern requiring precise assessment. Current benchmarks primarily concentrate on single-turn dialogues or a single jailbreak attack method to assess the safety. Additionally, these benchmarks have not taken into account the LLM's capability of identifying and handling unsafe information in detail. To address these issues, we propose a fine-grained benchmark SafeDialBench for evaluating the safety of LLMs across various jailbreak attacks in multi-turn dialogues. Specifically, we design a two-tier hierarchical safety taxonomy that considers 6 safety dimensions and generates more than 4000 multi-turn dialogues in both Chinese and English under 22 dialogue scenarios. We employ 7 jailbreak attack strategies, such as reference attack and purpose reverse, to enhance the dataset quality for dialogue generation. Notably, we construct an innovative assessment framework of LLMs, measuring capabilities in detecting, and handling unsafe information and maintaining consistency when facing jailbreak attacks. Experimental results across 17 LLMs reveal that Yi-34B-Chat and GLM4-9B-Chat demonstrate superior safety performance, while Llama3.1-8B-Instruct and o3-mini exhibit safety vulnerabilities.
翻译:随着大语言模型(LLMs)的快速发展,其安全性已成为需要精确评估的关键问题。现有基准主要集中于单轮对话或单一越狱攻击方法来评估安全性。此外,这些基准未能详细考虑LLM识别和处理不安全信息的能力。为解决这些问题,我们提出了一个细粒度基准SafeDialBench,用于评估LLM在多轮对话中应对各类越狱攻击的安全性。具体而言,我们设计了一个双层安全分类体系,涵盖6个安全维度,并在22种对话场景下生成了超过4000个中英文多轮对话。我们采用了7种越狱攻击策略(如引用攻击和目的反转)来提升对话生成数据集的质量。值得注意的是,我们构建了一个创新的LLM评估框架,用于衡量模型在检测与处理不安全信息、以及面对越狱攻击时保持一致性的能力。在17个LLM上的实验结果表明,Yi-34B-Chat和GLM4-9B-Chat展现出卓越的安全性能,而Llama3.1-8B-Instruct和o3-mini则存在安全漏洞。