There is growing interest in ensuring that large language models (LLMs) align with human values. However, the alignment of such models is vulnerable to adversarial jailbreaks, which coax LLMs into overriding their safety guardrails. The identification of these vulnerabilities is therefore instrumental in understanding inherent weaknesses and preventing future misuse. To this end, we propose Prompt Automatic Iterative Refinement (PAIR), an algorithm that generates semantic jailbreaks with only black-box access to an LLM. PAIR -- which is inspired by social engineering attacks -- uses an attacker LLM to automatically generate jailbreaks for a separate targeted LLM without human intervention. In this way, the attacker LLM iteratively queries the target LLM to update and refine a candidate jailbreak. Empirically, PAIR often requires fewer than twenty queries to produce a jailbreak, which is orders of magnitude more efficient than existing algorithms. PAIR also achieves competitive jailbreaking success rates and transferability on open and closed-source LLMs, including GPT-3.5/4, Vicuna, and PaLM-2.
翻译:近年来,确保大语言模型(LLMs)与人类价值观对齐的研究日益受到关注。然而,此类模型的对齐机制易受对抗性越狱攻击的影响——这类攻击会诱导LLMs突破其安全防护屏障。识别这些漏洞对于理解模型固有缺陷并防范未来滥用至关重要。为此,我们提出提示自动迭代优化算法(Prompt Automatic Iterative Refinement, PAIR),该算法仅需对LLM进行黑盒访问即可生成语义级越狱攻击。受社交工程攻击启发,PAIR利用攻击者LLM自动为目标LLM生成越狱提示,无需人工干预。攻击者LLM通过迭代查询目标LLM,持续更新并优化候选越狱策略。实验表明,PAIR通常仅需不到二十次查询即可生成有效越狱,效率较现有算法提升数个数量级。在包括GPT-3.5/4、Vicuna和PaLM-2在内的开源与闭源LLMs上,PAIR同时实现了具有竞争力的越狱成功率和迁移性。