Many studies have demonstrated that large language models (LLMs) can produce harmful responses, exposing users to unexpected risks when LLMs are deployed. Previous studies have proposed comprehensive taxonomies of the risks posed by LLMs, as well as corresponding prompts that can be used to examine the safety mechanisms of LLMs. However, the focus has been almost exclusively on English, and little has been explored for other languages. Here we aim to bridge this gap. We first introduce a dataset for the safety evaluation of Chinese LLMs, and then extend it to two other scenarios that can be used to better identify false negative and false positive examples in terms of risky prompt rejections. We further present a set of fine-grained safety assessment criteria for each risk type, facilitating both manual annotation and automatic evaluation in terms of LLM response harmfulness. Our experiments on five LLMs show that region-specific risks are the prevalent type of risk, presenting the major issue with all Chinese LLMs we experimented with. Our data is available at https://github.com/Libr-AI/do-not-answer. Warning: this paper contains example data that may be offensive, harmful, or biased.
翻译:许多研究表明,大型语言模型(LLMs)可能生成有害回复,在模型部署时会给用户带来意外风险。先前研究已提出针对LLM风险的全面分类体系,以及可用于检验LLM安全机制的对应提示语。然而,相关研究几乎完全聚焦于英语,对其他语言的探索极为有限。本研究旨在填补这一空白。我们首先构建了用于中文LLM安全评估的数据集,进而将其扩展至两种新场景,以更有效识别风险提示拒绝中的假阴性与假阳性案例。我们进一步为每类风险制定了细粒度安全评估标准,便于通过人工标注和自动评估两种方式衡量LLM回复的危害性。在五个LLM上的实验表明,地域性风险是当前最主要的风险类型,这也是我们测试的所有中文LLM存在的共性问题。本数据集发布于https://github.com/Libr-AI/do-not-answer。警告:本文包含可能具有冒犯性、有害性或偏见性的示例数据。