Large Language Models (LLMs) like ChatGPT face `jailbreak' challenges, where safeguards are bypassed to produce ethically harmful prompts. This study introduces a simple black-box method to effectively generate jailbreak prompts, overcoming the limitations of high complexity and computational costs associated with existing methods. The proposed technique iteratively rewrites harmful prompts into non-harmful expressions using the target LLM itself, based on the hypothesis that LLMs can directly sample safeguard-bypassing expressions. Demonstrated through experiments with ChatGPT (GPT-3.5 and GPT-4) and Gemini-Pro, this method achieved an attack success rate of over 80% within an average of 5 iterations and remained effective despite model updates. The jailbreak prompts generated were naturally-worded and concise, suggesting they are less detectable. The results indicate that creating effective jailbreak prompts is simpler than previously considered, and black-box jailbreak attacks pose a more serious security threat.
翻译:大型语言模型(LLM)如ChatGPT面临“越狱”挑战,即绕过安全防护生成具有伦理危害的提示词。本研究提出一种简单的黑盒方法,可高效生成越狱提示词,克服现有方法复杂度高、计算成本大的局限性。该技术基于“LLM能直接采样绕过安全防护的表达”这一假设,利用目标LLM自身将有害提示词反复改写为无害表达。通过ChatGPT(GPT-3.5与GPT-4)及Gemini-Pro实验证明,该方法在平均5次迭代内攻击成功率超过80%,且对模型更新保持有效性。生成的越狱提示词自然简洁,暗示其更不易被检测。结果表明,创建有效的越狱提示词比以往认为的更简单,且黑盒越狱攻击构成更严重的安全威胁。