We introduce LLMStinger, a novel approach that leverages Large Language Models (LLMs) to automatically generate adversarial suffixes for jailbreak attacks. Unlike traditional methods, which require complex prompt engineering or white-box access, LLMStinger uses a reinforcement learning (RL) loop to fine-tune an attacker LLM, generating new suffixes based on existing attacks for harmful questions from the HarmBench benchmark. Our method significantly outperforms existing red-teaming approaches (we compared against 15 of the latest methods), achieving a +57.2% improvement in Attack Success Rate (ASR) on LLaMA2-7B-chat and a +50.3% ASR increase on Claude 2, both models known for their extensive safety measures. Additionally, we achieved a 94.97% ASR on GPT-3.5 and 99.4% on Gemma-2B-it, demonstrating the robustness and adaptability of LLMStinger across open and closed-source models.
翻译:我们提出LLMStinger,一种新颖的方法,利用大型语言模型(LLMs)自动生成用于越狱攻击的对抗性后缀。与传统方法需要复杂的提示工程或白盒访问不同,LLMStinger采用强化学习(RL)循环来微调一个攻击者LLM,基于HarmBench基准中的有害问题,从现有攻击中生成新的后缀。我们的方法显著优于现有的红队测试方法(我们与15种最新方法进行了比较),在具有广泛安全措施的LLaMA2-7B-chat模型上实现了攻击成功率(ASR)提升+57.2%,在Claude 2模型上实现了ASR提升+50.3%。此外,我们在GPT-3.5上达到了94.97%的ASR,在Gemma-2B-it上达到了99.4%的ASR,证明了LLMStinger在开源和闭源模型上的鲁棒性和适应性。