GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher

Safety lies at the core of the development of Large Language Models (LLMs). There is ample work on aligning LLMs with human ethics and preferences, including data filtering in pretraining, supervised fine-tuning, reinforcement learning from human feedback, and red teaming, etc. In this study, we discover that chat in cipher can bypass the safety alignment techniques of LLMs, which are mainly conducted in natural languages. We propose a novel framework CipherChat to systematically examine the generalizability of safety alignment to non-natural languages -- ciphers. CipherChat enables humans to chat with LLMs through cipher prompts topped with system role descriptions and few-shot enciphered demonstrations. We use CipherChat to assess state-of-the-art LLMs, including ChatGPT and GPT-4 for different representative human ciphers across 11 safety domains in both English and Chinese. Experimental results show that certain ciphers succeed almost 100% of the time to bypass the safety alignment of GPT-4 in several safety domains, demonstrating the necessity of developing safety alignment for non-natural languages. Notably, we identify that LLMs seem to have a ''secret cipher'', and propose a novel SelfCipher that uses only role play and several demonstrations in natural language to evoke this capability. SelfCipher surprisingly outperforms existing human ciphers in almost all cases. Our code and data will be released at https://github.com/RobustNLP/CipherChat.

翻译：安全是大语言模型（LLMs）发展的核心。已有大量工作致力于将LLMs与人类伦理和偏好对齐，包括预训练中的数据过滤、监督微调、基于人类反馈的强化学习以及红队测试等。本研究发现，以密码形式进行的对话能够规避主要基于自然语言的安全对齐技术。我们提出新颖的框架CipherChat，系统性地检验安全对齐向非自然语言（即密码）的泛化能力。CipherChat通过结合系统角色描述和少量加密示例的密码提示，使人类能够与LLMs进行对话。我们利用CipherChat评估了包括ChatGPT和GPT-4在内的最先进LLMs，在11个安全领域的英文和中文场景中，针对不同代表性人类密码进行测试。实验结果表明，在某些安全领域，特定密码几乎100%成功绕过GPT-4的安全对齐，凸显了为非自然语言开发安全对齐的必要性。值得注意的是，我们发现LLMs似乎存在一种"秘密密码"，并据此提出新颖的SelfCipher方法——仅通过角色扮演和少量自然语言示例即可激发此能力。SelfCipher在几乎所有情况下均显著优于现有的人类密码。我们的代码和数据将发布于https://github.com/RobustNLP/CipherChat。

相关内容

GPT-4

关注 29

北京时间2023年3月15日凌晨，ChatGPT开发商OpenAI 发布了发布了全新的多模态预训练大模型 GPT-4，可以更可靠、更具创造力、能处理更细节的指令，根据图片和文字提示都能生成相应内容。具体来说来说，GPT-4 相比上一代的模型，实现了飞跃式提升：支持图像和文本输入，拥有强大的识图能力；大幅提升了文字输入限制，在ChatGPT模式下，GPT-4可以处理超过2.5万字的文本，可以处理一些更加细节的指令；回答准确性也得到了显著提高。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日