Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses

Recently, Anil et al. (2024) show that many-shot (up to hundreds of) demonstrations can jailbreak state-of-the-art LLMs by exploiting their long-context capability. Nevertheless, is it possible to use few-shot demonstrations to efficiently jailbreak LLMs within limited context sizes? While the vanilla few-shot jailbreaking may be inefficient, we propose improved techniques such as injecting special system tokens like [/INST] and employing demo-level random search from a collected demo pool. These simple techniques result in surprisingly effective jailbreaking against aligned LLMs (even with advanced defenses). For examples, our method achieves >80% (mostly >95%) ASRs on Llama-2-7B and Llama-3-8B without multiple restarts, even if the models are enhanced by strong defenses such as perplexity detection and/or SmoothLLM, which is challenging for suffix-based jailbreaking. In addition, we conduct comprehensive and elaborate (e.g., making sure to use correct system prompts) evaluations against other aligned LLMs and advanced defenses, where our method consistently achieves nearly 100% ASRs. Our code is available at https://github.com/sail-sg/I-FSJ.

翻译：近期，Anil等人（2024）的研究表明，通过利用大语言模型的长上下文能力，多样本（可达数百个）演示能够对最先进的大语言模型实现越狱。然而，是否可能使用少样本演示在有限上下文长度内高效地实现大语言模型越狱？尽管原始的少样本越狱方法可能效率低下，我们提出了改进技术，例如注入特殊系统标记（如[/INST]）以及从收集的演示池中进行演示级随机搜索。这些简单技术在对齐大语言模型（即使具备高级防御机制）上产生了惊人的有效越狱效果。例如，我们的方法在Llama-2-7B和Llama-3-8B模型上实现了>80%（多数情况下>95%）的攻击成功率，且无需多次重启——即使这些模型通过困惑度检测和/或SmoothLLM等强防御机制进行了增强，这对基于后缀的越狱方法而言颇具挑战。此外，我们对其他对齐大语言模型和高级防御机制进行了全面细致的评估（例如确保使用正确的系统提示），我们的方法始终能实现接近100%的攻击成功率。代码已开源：https://github.com/sail-sg/I-FSJ。

相关内容

小样本学习

关注 216

小样本学习（Few-Shot Learning，以下简称 FSL ）用于解决当可用的数据量比较少时，如何提升神经网络的性能。在 FSL 中，经常用到的一类方法被称为 Meta-learning。和普通的神经网络的训练方法一样，Meta-learning 也包含训练过程和测试过程，但是它的训练过程被称作 Meta-training 和 Meta-testing。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日