Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking

While large language models (LLMs) have demonstrated increasing power, they have also given rise to a wide range of harmful behaviors. As representatives, jailbreak attacks can provoke harmful or unethical responses from LLMs, even after safety alignment. In this paper, we investigate a novel category of jailbreak attacks specifically designed to target the cognitive structure and processes of LLMs. Specifically, we analyze the safety vulnerability of LLMs in the face of (1) multilingual cognitive overload, (2) veiled expression, and (3) effect-to-cause reasoning. Different from previous jailbreak attacks, our proposed cognitive overload is a black-box attack with no need for knowledge of model architecture or access to model weights. Experiments conducted on AdvBench and MasterKey reveal that various LLMs, including both popular open-source model Llama 2 and the proprietary model ChatGPT, can be compromised through cognitive overload. Motivated by cognitive psychology work on managing cognitive load, we further investigate defending cognitive overload attack from two perspectives. Empirical studies show that our cognitive overload from three perspectives can jailbreak all studied LLMs successfully, while existing defense strategies can hardly mitigate the caused malicious uses effectively.

翻译：尽管大型语言模型（LLMs）展现出日益强大的能力，但也引发了广泛的有害行为。作为典型代表，越狱攻击能够在安全对齐后仍诱导LLMs产生有害或不道德的回应。本文研究了一类专门针对LLMs认知结构与过程的新型越狱攻击。具体而言，我们分析了LLMs在面对（1）多语言认知过载、（2）隐晦表达和（3）果因推理时的安全脆弱性。与以往越狱攻击不同，我们提出的认知过载是一种无需了解模型架构或访问模型权重的黑盒攻击。在AdvBench和MasterKey上进行的实验表明，包括开源模型Llama 2和专有模型ChatGPT在内的多种LLMs均可通过认知过载被攻破。受认知心理学中认知负荷管理研究的启发，我们进一步从两个角度探究了针对认知过载攻击的防御策略。实证研究表明，本文提出的三方面认知过载方法能成功越狱所有受测LLMs，而现有防御策略难以有效缓解由此引发的恶意滥用。

相关内容

Cognition

关注 4

Cognition：Cognition：International Journal of Cognitive Science Explanation：认知：国际认知科学杂志。 Publisher：Elsevier。 SIT： http://www.journals.elsevier.com/cognition/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日