Large Language Models (LLMs) presents significant priority in text understanding and generation. However, LLMs suffer from the risk of generating harmful contents especially while being employed to applications. There are several black-box attack methods, such as Prompt Attack, which can change the behaviour of LLMs and induce LLMs to generate unexpected answers with harmful contents. Researchers are interested in Prompt Attack and Defense with LLMs, while there is no publicly available dataset with high successful attacking rate to evaluate the abilities of defending prompt attack. In this paper, we introduce a pipeline to construct high-quality prompt attack samples, along with a Chinese prompt attack dataset called CPAD. Our prompts aim to induce LLMs to generate unexpected outputs with several carefully designed prompt attack templates and widely concerned attacking contents. Different from previous datasets involving safety estimation, we construct the prompts considering three dimensions: contents, attacking methods and goals. Especially, the attacking goals indicate the behaviour expected after successfully attacking the LLMs, thus the responses can be easily evaluated and analysed. We run several popular Chinese LLMs on our dataset, and the results show that our prompts are significantly harmful to LLMs, with around 70% attack success rate to GPT-3.5. CPAD is publicly available at https://github.com/liuchengyuan123/CPAD.
翻译:大型语言模型在文本理解和生成方面展现出显著优势。然而,在部署至实际应用时,这些模型存在生成有害内容的潜在风险。现有多种黑盒攻击方法(如提示攻击)可改变大语言模型的行为模式,诱导其生成包含有害内容的意外回答。当前研究者关注大语言模型的提示攻击与防御技术,但尚未存在公开的高攻击成功率数据集用于评估提示防御能力。本文提出一套构建高质量提示攻击样本的流程,并创建了名为CPAD的中文提示攻击数据集。我们采用多种精心设计的提示攻击模板,结合广泛关注的攻击内容,旨在诱导大语言模型产生非预期输出。与先前涉及安全评估的数据集不同,本数据集从内容维度、攻击方法与攻击目标三个层面构建提示样本。特别地,攻击目标指明了成功攻击后预期的模型行为,从而使响应结果易于评估分析。我们基于多个主流中文大语言模型进行实验,结果显示本提示集对模型具有显著危害性,对GPT-3.5的攻击成功率约达70%。CPAD数据集已开源发布于https://github.com/liuchengyuan123/CPAD。