In recent years, large language models (LLMs) have demonstrated notable success across various tasks, but the trustworthiness of LLMs is still an open problem. One specific threat is the potential to generate toxic or harmful responses. Attackers can craft adversarial prompts that induce harmful responses from LLMs. In this work, we pioneer a theoretical foundation in LLMs security by identifying bias vulnerabilities within the safety fine-tuning and design a black-box jailbreak method named DRA (Disguise and Reconstruction Attack), which conceals harmful instructions through disguise and prompts the model to reconstruct the original harmful instruction within its completion. We evaluate DRA across various open-source and close-source models, showcasing state-of-the-art jailbreak success rates and attack efficiency. Notably, DRA boasts a 90\% attack success rate on LLM chatbots GPT-4.
翻译:近年来,大型语言模型(LLMs)在各种任务中展现出了显著的成功,但LLMs的可靠性仍是一个开放性问题。其中一项特定威胁是潜在生成有毒或有害响应的能力。攻击者可以设计对抗性提示,诱导LLMs产生有害响应。在本工作中,我们首次通过识别安全微调中的偏差漏洞,为LLMs安全性奠定了理论基础,并设计了一种名为DRA(伪装与重构攻击)的黑盒越狱方法。该方法通过伪装隐藏有害指令,并提示模型在其完成内容中重构原始有害指令。我们在多种开源和闭源模型上评估了DRA,展示了最先进的越狱成功率和攻击效率。值得注意的是,DRA在LLM聊天机器人GPT-4上实现了90%的攻击成功率。