While large language models (LLMs) have demonstrated increasing power, they have also given rise to a wide range of harmful behaviors. As representatives, jailbreak attacks can provoke harmful or unethical responses from LLMs, even after safety alignment. In this paper, we investigate a novel category of jailbreak attacks specifically designed to target the cognitive structure and processes of LLMs. Specifically, we analyze the safety vulnerability of LLMs in the face of (1) multilingual cognitive overload, (2) veiled expression, and (3) effect-to-cause reasoning. Different from previous jailbreak attacks, our proposed cognitive overload is a black-box attack with no need for knowledge of model architecture or access to model weights. Experiments conducted on AdvBench and MasterKey reveal that various LLMs, including both popular open-source model Llama 2 and the proprietary model ChatGPT, can be compromised through cognitive overload. Motivated by cognitive psychology work on managing cognitive load, we further investigate defending cognitive overload attack from two perspectives. Empirical studies show that our cognitive overload from three perspectives can jailbreak all studied LLMs successfully, while existing defense strategies can hardly mitigate the caused malicious uses effectively.
翻译:尽管大型语言模型(LLMs)展现出日益强大的能力,但也引发了广泛的有害行为。其中,越狱攻击作为典型代表,能够诱导LLMs在安全对齐后仍输出有害或不道德的回应。本文研究了一种专门针对LLMs认知结构与认知过程的新型越狱攻击。具体而言,我们分析了LLMs在以下场景中的安全脆弱性:(1)多语言认知过载;(2)隐晦表达;(3)因果倒置推理。与以往越狱攻击不同,我们提出的"认知过载"是一种黑盒攻击,无需了解模型架构或获取模型权重。在AdvBench和MasterKey基准上的实验表明,包括开源模型Llama 2和闭源模型ChatGPT在内的多种LLMs均可通过认知过载实现攻击。受认知心理学中认知负荷管理研究的启发,我们进一步从两个角度探索了针对认知过载攻击的防御策略。实证研究表明,我们提出的三方面认知过载能成功攻破所有受测LLMs,而现有防御策略难以有效缓解由此引发的恶意使用问题。