Despite efforts to align large language models (LLMs) with human values, widely-used LLMs such as GPT, Llama, Claude, and PaLM are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content. To address this vulnerability, we propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on LLMs. Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs. SmoothLLM reduces the attack success rate on numerous popular LLMs to below one percentage point, avoids unnecessary conservatism, and admits provable guarantees on attack mitigation. Moreover, our defense uses exponentially fewer queries than existing attacks and is compatible with any LLM.
翻译:尽管在使大语言模型(LLMs)与人类价值观对齐方面付出了诸多努力,但广泛使用的LLMs(如GPT、Llama、Claude和PaLM)仍易受越狱攻击——攻击者通过欺骗目标LLM生成不当内容。为应对这一漏洞,我们提出SmoothLLM,这是首个旨在缓解LLM越狱攻击的算法。基于我们发现对抗性生成的提示在字符级变化下具有脆弱性,该防御首先对给定输入提示的多个副本进行随机扰动,然后聚合相应的预测结果以检测对抗性输入。SmoothLLM将众多流行LLM的攻击成功率降至不足一个百分点,避免了不必要的保守性,并提供了可证明的攻击缓解保证。此外,我们的防御所使用的查询量呈指数级少于现有攻击,且兼容任何LLM。