Despite efforts to align large language models (LLMs) with human values, widely-used LLMs such as GPT, Llama, Claude, and PaLM are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content. To address this vulnerability, we propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on LLMs. Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs. SmoothLLM reduces the attack success rate on numerous popular LLMs to below one percentage point, avoids unnecessary conservatism, and admits provable guarantees on attack mitigation. Moreover, our defense uses exponentially fewer queries than existing attacks and is compatible with any LLM. Our code is publicly available at the following link: https://github.com/arobey1/smooth-llm.
翻译:尽管大语言模型(LLMs)在与其人类价值观对齐方面已取得进展,但广泛使用的LLMs(如GPT、Llama、Claude和PaLM)仍易受越狱攻击——攻击者可诱使目标LLM生成不当内容。针对这一漏洞,我们提出SmoothLLM,这是首个旨在缓解LLM越狱攻击的算法。基于我们发现的对抗性生成提示对字符级变化具有脆弱性的结论,该防御方法首先对输入提示的多个副本进行随机扰动,随后聚合相应预测结果以检测对抗性输入。SmoothLLM可将众多流行LLM的攻击成功率降至不足一个百分点,避免不必要的保守性,并提供可证明的攻击缓解保证。此外,我们的防御所需查询次数比现有攻击呈指数级减少,且兼容任意LLM。代码已开源于:https://github.com/arobey1/smooth-llm