To demonstrate and address the underlying maliciousness, we propose a theoretical hypothesis and analytical approach, and introduce a new black-box jailbreak attack methodology named IntentObfuscator, exploiting this identified flaw by obfuscating the true intentions behind user prompts.This approach compels LLMs to inadvertently generate restricted content, bypassing their built-in content security measures. We detail two implementations under this framework: "Obscure Intention" and "Create Ambiguity", which manipulate query complexity and ambiguity to evade malicious intent detection effectively. We empirically validate the effectiveness of the IntentObfuscator method across several models, including ChatGPT-3.5, ChatGPT-4, Qwen and Baichuan, achieving an average jailbreak success rate of 69.21\%. Notably, our tests on ChatGPT-3.5, which claims 100 million weekly active users, achieved a remarkable success rate of 83.65\%. We also extend our validation to diverse types of sensitive content like graphic violence, racism, sexism, political sensitivity, cybersecurity threats, and criminal skills, further proving the substantial impact of our findings on enhancing 'Red Team' strategies against LLM content security frameworks.
翻译:为了揭示并解决底层恶意问题,我们提出了一项理论假设与分析框架,并引入了一种名为IntentObfuscator的新型黑盒越狱攻击方法。该方法通过混淆用户提示背后的真实意图,利用已识别的缺陷迫使大语言模型(LLMs)无意中生成限制性内容,从而绕过其内置的内容安全措施。我们详细阐述了该框架下的两种具体实现:“意图模糊化”与“制造歧义”,这两种方法通过操纵查询的复杂性与模糊性,有效规避恶意意图检测。我们在包括ChatGPT-3.5、ChatGPT-4、通义千问(Qwen)及百川(Baichuan)等多个模型上进行了实证验证,结果显示IntentObfuscator方法的平均越狱成功率达到69.21%。值得注意的是,在宣称拥有1亿周活跃用户的ChatGPT-3.5测试中,我们取得了83.65%的显著成功率。此外,我们将验证范围扩展至多种敏感内容类型,如暴力画面、种族主义、性别歧视、政治敏感性、网络安全威胁及犯罪技能,进一步证明了我们的发现对强化针对LLM内容安全框架的“红队”策略具有实质性影响。