Many-shot jailbreaking circumvents the safety alignment of large language models by exploiting their ability to process long input sequences. To achieve this, the malicious target prompt is prefixed with hundreds of fabricated conversational turns between the user and the model. These fabricated exchanges are randomly sampled from a pool of malicious questions and responses, making it appear as though the model has already complied with harmful instructions. In this paper, we present PANDAS: a hybrid technique that improves many-shot jailbreaking by modifying these fabricated dialogues with positive affirmations, negative demonstrations, and an optimized adaptive sampling method tailored to the target prompt's topic. Extensive experiments on AdvBench and HarmBench, using state-of-the-art LLMs, demonstrate that PANDAS significantly outperforms baseline methods in long-context scenarios. Through an attention analysis, we provide insights on how long-context vulnerabilities are exploited and show how PANDAS further improves upon many-shot jailbreaking.
翻译:多轮越狱攻击通过利用大语言模型处理长输入序列的能力,规避其安全对齐机制。为实现此目的,攻击者在恶意目标提示前添加数百轮伪造的用户与模型对话。这些伪造的对话从恶意问题与回应的语料库中随机采样,使模型看似已遵从有害指令。本文提出PANDAS:一种混合技术,通过采用积极肯定、负面演示以及针对目标提示主题优化的自适应采样方法改进伪造对话,从而增强多轮越狱攻击效果。基于先进大语言模型在AdvBench与HarmBench上的大量实验表明,在长上下文场景中PANDAS显著优于基线方法。通过注意力机制分析,我们揭示了长上下文漏洞的利用机制,并展示了PANDAS如何进一步优化多轮越狱攻击。