Safety Assessment of Chinese Large Language Models

With the rapid popularity of large language models such as ChatGPT and GPT-4, a growing amount of attention is paid to their safety concerns. These models may generate insulting and discriminatory content, reflect incorrect social values, and may be used for malicious purposes such as fraud and dissemination of misleading information. Evaluating and enhancing their safety is particularly essential for the wide application of large language models (LLMs). To further promote the safe deployment of LLMs, we develop a Chinese LLM safety assessment benchmark. Our benchmark explores the comprehensive safety performance of LLMs from two perspectives: 8 kinds of typical safety scenarios and 6 types of more challenging instruction attacks. Our benchmark is based on a straightforward process in which it provides the test prompts and evaluates the safety of the generated responses from the evaluated model. In evaluation, we utilize the LLM's strong evaluation ability and develop it as a safety evaluator by prompting. On top of this benchmark, we conduct safety assessments and analyze 15 LLMs including the OpenAI GPT series and other well-known Chinese LLMs, where we observe some interesting findings. For example, we find that instruction attacks are more likely to expose safety issues of all LLMs. Moreover, to promote the development and deployment of safe, responsible, and ethical AI, we publicly release SafetyPrompts including 100k augmented prompts and responses by LLMs.

翻译：随着ChatGPT和GPT-4等大语言模型的迅速普及，其安全性问题日益受到关注。这些模型可能生成侮辱性和歧视性内容，反映错误的社会价值观，并可能被用于欺诈和传播误导信息等恶意目的。评估和提升其安全性对于大语言模型（LLMs）的广泛应用至关重要。为进一步推动LLMs的安全部署，我们开发了一个中文LLM安全性评估基准。该基准从两个维度探索LLMs的综合安全性能：8种典型安全场景和6种更具挑战性的指令攻击类型。我们的基准基于一个简洁流程：提供测试提示，并评估被测试模型生成回复的安全性。在评估过程中，我们利用LLM强大的评估能力，通过提示将其发展为安全评估器。基于该基准，我们对包括OpenAI GPT系列及其他知名中文LLM在内的15个模型进行了安全性评估与分析，发现了一些有趣的结论。例如，我们发现指令攻击更容易暴露所有LLM的安全问题。此外，为促进安全、负责任且合乎道德的人工智能的开发与部署，我们公开发布了SafetyPrompts，其中包括10万条增强提示及LLM生成的回复。

相关内容

安全评估

关注 11

安全评估分狭义和广义二种。狭义指对一个具有特定功能的工作系统中固有的或潜在的危险及其严重程度所进行的分析与评估，并以既定指数、等级或概率值作出定量的表示，最后根据定量值的大小决定采取预防或防护对策。广义指利用系统工程原理和方法对拟建或已有工程、系统可能存在的危险性及其可能产生的后果进行综合评价和预测，并根据可能导致的事故风险的大小，提出相应的安全对策措施，以达到工程、系统安全的过程。安全评估又称风险评估、危险评估，或称安全评价、风险评价和危险评价。

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日