Most traditional AI safety research has approached AI models as machines and centered on algorithm-focused attacks developed by security experts. As large language models (LLMs) become increasingly common and competent, non-expert users can also impose risks during daily interactions. This paper introduces a new perspective to jailbreak LLMs as human-like communicators, to explore this overlooked intersection between everyday language interaction and AI safety. Specifically, we study how to persuade LLMs to jailbreak them. First, we propose a persuasion taxonomy derived from decades of social science research. Then, we apply the taxonomy to automatically generate interpretable persuasive adversarial prompts (PAP) to jailbreak LLMs. Results show that persuasion significantly increases the jailbreak performance across all risk categories: PAP consistently achieves an attack success rate of over $92\%$ on Llama 2-7b Chat, GPT-3.5, and GPT-4 in $10$ trials, surpassing recent algorithm-focused attacks. On the defense side, we explore various mechanisms against PAP and, found a significant gap in existing defenses, and advocate for more fundamental mitigation for highly interactive LLMs
翻译:大多数传统的AI安全研究将AI模型视为机器,并集中于安全专家开发的以算法为核心的攻击方法。随着大语言模型(LLM)日益普及且能力增强,非专业用户在日常交互中也可能带来风险。本文提出了一种新视角,将LLM视为类人沟通者进行越狱,以探索日常语言交互与AI安全之间被忽视的交集。具体而言,我们研究如何说服LLM使其越狱。首先,我们基于数十年的社会科学研究成果提出一种说服力分类体系。随后,将该分类体系应用于自动生成可解释的说服性对抗提示(PAP),以越狱LLM。结果表明,说服力显著提升了所有风险类别下的越狱性能:在10次试验中,PAP对Llama 2-7b Chat、GPT-3.5和GPT-4的攻击成功率始终超过92%,超越了近期以算法为核心的攻击方法。在防御方面,我们探索了针对PAP的多种机制,发现现有防御存在显著不足,并倡导为高度交互的LLM开发更根本的缓解措施。