GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher

Safety lies at the core of the development of Large Language Models (LLMs). There is ample work on aligning LLMs with human ethics and preferences, including data filtering in pretraining, supervised fine-tuning, reinforcement learning from human feedback, and red teaming, etc. In this study, we discover that chat in cipher can bypass the safety alignment techniques of LLMs, which are mainly conducted in natural languages. We propose a novel framework CipherChat to systematically examine the generalizability of safety alignment to non-natural languages -- ciphers. CipherChat enables humans to chat with LLMs through cipher prompts topped with system role descriptions and few-shot enciphered demonstrations. We use CipherChat to assess state-of-the-art LLMs, including ChatGPT and GPT-4 for different representative human ciphers across 11 safety domains in both English and Chinese. Experimental results show that certain ciphers succeed almost 100% of the time to bypass the safety alignment of GPT-4 in several safety domains, demonstrating the necessity of developing safety alignment for non-natural languages. Notably, we identify that LLMs seem to have a ''secret cipher'', and propose a novel SelfCipher that uses only role play and several demonstrations in natural language to evoke this capability. SelfCipher surprisingly outperforms existing human ciphers in almost all cases. Our code and data will be released at https://github.com/RobustNLP/CipherChat.

翻译：安全性位于大型语言模型（LLM）发展的核心。当前已有大量研究致力于使LLM与人类伦理和偏好对齐，包括预训练中的数据过滤、监督微调、基于人类反馈的强化学习以及红队测试等。本研究发现，通过密码进行对话能够绕过主要基于自然语言的安全对齐技术。我们提出新型框架CipherChat，系统性地检验安全对齐对非自然语言（即密码）的泛化能力。CipherChat允许人类通过包含系统角色描述和少量密码示例的提示词与LLM进行加密对话。我们利用CipherChat评估了包括ChatGPT和GPT-4在内的先进LLM，在英语和中文的11个安全领域中测试了多种代表性人类密码。实验结果表明，在若干安全领域内，特定密码几乎100%地成功绕过了GPT-4的安全对齐机制，这凸显了为非自然语言开发安全对齐技术的必要性。值得注意的是，我们发现LLM似乎存在“秘密密码”，并据此提出新型SelfCipher方法——仅通过角色扮演和若干自然语言示例即可唤起该能力。令人惊讶的是，SelfCipher在几乎所有场景中均优于现有的人类密码。相关代码和数据集将发布于https://github.com/RobustNLP/CipherChat。

相关内容

GPT-4

关注 29

北京时间2023年3月15日凌晨，ChatGPT开发商OpenAI 发布了发布了全新的多模态预训练大模型 GPT-4，可以更可靠、更具创造力、能处理更细节的指令，根据图片和文字提示都能生成相应内容。具体来说来说，GPT-4 相比上一代的模型，实现了飞跃式提升：支持图像和文本输入，拥有强大的识图能力；大幅提升了文字输入限制，在ChatGPT模式下，GPT-4可以处理超过2.5万字的文本，可以处理一些更加细节的指令；回答准确性也得到了显著提高。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日