Despite efforts to align large language models (LLMs) with human values, widely-used LLMs such as GPT, Llama, Claude, and PaLM are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content. To address this vulnerability, we propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on LLMs. Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs. SmoothLLM reduces the attack success rate on numerous popular LLMs to below one percentage point, avoids unnecessary conservatism, and admits provable guarantees on attack mitigation. Moreover, our defense uses exponentially fewer queries than existing attacks and is compatible with any LLM.
翻译:尽管研究者们致力于使大语言模型与人类价值观对齐,但广泛使用的GPT、Llama、Claude和PaLM等大语言模型仍易受越狱攻击——攻击者可通过欺骗手段诱使目标模型生成不当内容。为解决这一漏洞,我们提出SmoothLLM,这是首个专为缓解大语言模型越狱攻击而设计的算法。基于对抗性生成的提示对字符级变化具有脆弱性这一发现,我们的防御方法首先对给定输入提示的多个副本进行随机扰动,随后聚合对应预测结果以检测对抗性输入。SmoothLLM可将多种主流大语言模型的攻击成功率降至1%以下,避免不必要的保守性,并提供可证明的攻击缓解保证。此外,该防御方法所需查询次数呈指数级少于现有攻击方法,且兼容任意大语言模型。