While large language models (LLMs) have demonstrated increasing power, they have also given rise to a wide range of harmful behaviors. As representatives, jailbreak attacks can provoke harmful or unethical responses from LLMs, even after safety alignment. In this paper, we investigate a novel category of jailbreak attacks specifically designed to target the cognitive structure and processes of LLMs. Specifically, we analyze the safety vulnerability of LLMs in the face of (1) multilingual cognitive overload, (2) veiled expression, and (3) effect-to-cause reasoning. Different from previous jailbreak attacks, our proposed cognitive overload is a black-box attack with no need for knowledge of model architecture or access to model weights. Experiments conducted on AdvBench and MasterKey reveal that various LLMs, including both popular open-source model Llama 2 and the proprietary model ChatGPT, can be compromised through cognitive overload. Motivated by cognitive psychology work on managing cognitive load, we further investigate defending cognitive overload attack from two perspectives. Empirical studies show that our cognitive overload from three perspectives can jailbreak all studied LLMs successfully, while existing defense strategies can hardly mitigate the caused malicious uses effectively.
翻译:尽管大型语言模型(LLMs)展现出日益强大的能力,但也引发了广泛的有害行为。作为典型代表,越狱攻击能够在安全对齐后仍诱导LLMs产生有害或不道德的回应。本文研究了一类专门针对LLMs认知结构与过程的新型越狱攻击。具体而言,我们分析了LLMs在面对(1)多语言认知过载、(2)隐晦表达和(3)果因推理时的安全脆弱性。与以往越狱攻击不同,我们提出的认知过载是一种无需了解模型架构或访问模型权重的黑盒攻击。在AdvBench和MasterKey上进行的实验表明,包括开源模型Llama 2和专有模型ChatGPT在内的多种LLMs均可通过认知过载被攻破。受认知心理学中认知负荷管理研究的启发,我们进一步从两个角度探究了针对认知过载攻击的防御策略。实证研究表明,本文提出的三方面认知过载方法能成功越狱所有受测LLMs,而现有防御策略难以有效缓解由此引发的恶意滥用。