Large language models (LLMs) are designed to align with human values in their responses. This study exploits LLMs with an iterative prompting technique where each prompt is systematically modified and refined across multiple iterations to enhance its effectiveness in jailbreaking attacks progressively. This technique involves analyzing the response patterns of LLMs, including GPT-3.5, GPT-4, LLaMa2, Vicuna, and ChatGLM, allowing us to adjust and optimize prompts to evade the LLMs' ethical and security constraints. Persuasion strategies enhance prompt effectiveness while maintaining consistency with malicious intent. Our results show that the attack success rates (ASR) increase as the attacking prompts become more refined with the highest ASR of 90% for GPT4 and ChatGLM and the lowest ASR of 68% for LLaMa2. Our technique outperforms baseline techniques (PAIR and PAP) in ASR and shows comparable performance with GCG and ArtPrompt.
翻译:大型语言模型(LLMs)在设计上需使其回应与人类价值观保持一致。本研究采用迭代式提示技术对LLMs进行攻击,通过在多轮迭代中系统性地修改和优化提示,逐步提升越狱攻击的有效性。该技术通过分析GPT-3.5、GPT-4、LLaMa2、Vicuna和ChatGLM等模型的响应模式,使我们能够调整并优化提示以规避LLMs的伦理与安全约束。说服策略的引入在保持恶意意图一致性的同时,进一步增强了提示的有效性。实验结果表明,随着攻击提示的不断优化,攻击成功率(ASR)持续上升:GPT4与ChatGLM的最高ASR达到90%,而LLaMa2的最低ASR为68%。本技术在ASR上优于基线方法(PAIR与PAP),并与GCG及ArtPrompt展现出相当的性能。