To demonstrate and address the underlying maliciousness, we propose a theoretical hypothesis and analytical approach, and introduce a new black-box jailbreak attack methodology named IntentObfuscator, exploiting this identified flaw by obfuscating the true intentions behind user prompts.This approach compels LLMs to inadvertently generate restricted content, bypassing their built-in content security measures. We detail two implementations under this framework: "Obscure Intention" and "Create Ambiguity", which manipulate query complexity and ambiguity to evade malicious intent detection effectively. We empirically validate the effectiveness of the IntentObfuscator method across several models, including ChatGPT-3.5, ChatGPT-4, Qwen and Baichuan, achieving an average jailbreak success rate of 69.21\%. Notably, our tests on ChatGPT-3.5, which claims 100 million weekly active users, achieved a remarkable success rate of 83.65\%. We also extend our validation to diverse types of sensitive content like graphic violence, racism, sexism, political sensitivity, cybersecurity threats, and criminal skills, further proving the substantial impact of our findings on enhancing 'Red Team' strategies against LLM content security frameworks.
翻译:为揭示并应对潜在恶意性,我们提出理论假设与分析路径,并引入名为IntentObfuscator的新型黑盒越狱攻击方法,该方法通过混淆用户提示背后的真实意图来利用这一缺陷。该手段迫使大语言模型在绕过内置内容安全机制的情况下,无意中生成受限内容。我们详述了该框架下的两种实现方式:"模糊意图"与"制造歧义",通过操纵查询复杂度与歧义性有效规避恶意意图检测。我们跨多个模型(包括ChatGPT-3.5、ChatGPT-4、通义千问与百川)实证验证了IntentObfuscator方法的有效性,平均越狱成功率达69.21%。值得注意的是,在声称拥有1亿周活跃用户的ChatGPT-3.5测试中,我们取得了83.65%的显著成功率。此外,我们将验证扩展到图形暴力、种族主义、性别歧视、政治敏感性、网络安全威胁及犯罪技能等多类敏感内容,进一步证明我们的发现对强化针对大语言模型内容安全框架的"红队"策略具有实质性影响。