Large Language Models (LLMs) present significant priority in text understanding and generation. However, LLMs suffer from the risk of generating harmful contents especially while being employed to applications. There are several black-box attack methods, such as Prompt Attack, which can change the behaviour of LLMs and induce LLMs to generate unexpected answers with harmful contents. Researchers are interested in Prompt Attack and Defense with LLMs, while there is no publicly available dataset to evaluate the abilities of defending prompt attack. In this paper, we introduce a Chinese Prompt Attack Dataset for LLMs, called CPAD. Our prompts aim to induce LLMs to generate unexpected outputs with several carefully designed prompt attack approaches and widely concerned attacking contents. Different from previous datasets involving safety estimation, We construct the prompts considering three dimensions: contents, attacking methods and goals, thus the responses can be easily evaluated and analysed. We run several well-known Chinese LLMs on our dataset, and the results show that our prompts are significantly harmful to LLMs, with around 70% attack success rate. We will release CPAD to encourage further studies on prompt attack and defense.
翻译:大语言模型(LLMs)在文本理解与生成方面展现出显著优势。然而,尤其是在实际应用部署时,LLMs存在生成有害内容的潜在风险。当前已有若干黑盒攻击方法(如提示攻击)能够改变LLMs的行为模式,诱导其生成含有有害内容的意外回复。尽管研究者对LLMs的提示攻击与防御技术日益关注,但目前尚缺乏公开数据集用于评估模型抵御此类攻击的能力。本文提出了面向LLMs的中文提示攻击数据集CPAD(Chinese Prompt Attack Dataset)。我们通过精心设计的多种提示攻击方法,结合广泛关注的攻击内容,旨在诱导LLMs生成非预期的有害输出。与以往涉及安全评估的数据集不同,本数据集从内容维度、攻击方法与攻击目标三个层面构建提示样本,从而便于对模型响应进行量化评估与分析。我们在多个主流中文LLMs上开展实验,结果表明本数据集中的提示样本对模型具有显著危害性,攻击成功率约为70%。CPAD数据集将公开发布,以促进提示攻击与防御领域的后续研究。