As large language models (LLMs) grow more powerful, ensuring their safety against misuse becomes crucial. While researchers have focused on developing robust defenses, no method has yet achieved complete invulnerability to attacks. We propose an alternative approach: instead of seeking perfect adversarial robustness, we develop rapid response techniques to look to block whole classes of jailbreaks after observing only a handful of attacks. To study this setting, we develop RapidResponseBench, a benchmark that measures a defense's robustness against various jailbreak strategies after adapting to a few observed examples. We evaluate five rapid response methods, all of which use jailbreak proliferation, where we automatically generate additional jailbreaks similar to the examples observed. Our strongest method, which fine-tunes an input classifier to block proliferated jailbreaks, reduces attack success rate by a factor greater than 240 on an in-distribution set of jailbreaks and a factor greater than 15 on an out-of-distribution set, having observed just one example of each jailbreaking strategy. Moreover, further studies suggest that the quality of proliferation model and number of proliferated examples play an key role in the effectiveness of this defense. Overall, our results highlight the potential of responding rapidly to novel jailbreaks to limit LLM misuse.
翻译:随着大型语言模型(LLMs)能力日益增强,确保其安全性以防止滥用变得至关重要。尽管研究者们致力于开发鲁棒的防御方法,但目前尚未有任何技术能够完全抵御所有攻击。我们提出了一种替代方案:不追求完美的对抗鲁棒性,而是开发快速响应技术,在仅观察到少量攻击样本后,即可阻断整类越狱攻击。为研究这一场景,我们构建了RapidResponseBench基准测试,用于评估防御系统在适应少量观察样本后对各种越狱策略的鲁棒性。我们评估了五种快速响应方法,这些方法均采用越狱扩散技术——即根据观察到的示例自动生成类似的额外越狱攻击。我们最强的方法通过微调输入分类器来阻断扩散生成的越狱攻击,在仅观察到每种越狱策略的一个示例的情况下,将分布内越狱攻击的成功率降低了240倍以上,对分布外越狱攻击的成功率降低了15倍以上。进一步研究表明,扩散模型的质量与生成示例的数量对该防御效果具有关键影响。总体而言,我们的研究结果凸显了通过快速响应新型越狱攻击来限制LLM滥用的潜力。