Safety lies at the core of the development of Large Language Models (LLMs). There is ample work on aligning LLMs with human ethics and preferences, including data filtering in pretraining, supervised fine-tuning, reinforcement learning from human feedback, and red teaming, etc. In this study, we discover that chat in cipher can bypass the safety alignment techniques of LLMs, which are mainly conducted in natural languages. We propose a novel framework CipherChat to systematically examine the generalizability of safety alignment to non-natural languages -- ciphers. CipherChat enables humans to chat with LLMs through cipher prompts topped with system role descriptions and few-shot enciphered demonstrations. We use CipherChat to assess state-of-the-art LLMs, including ChatGPT and GPT-4 for different representative human ciphers across 11 safety domains in both English and Chinese. Experimental results show that certain ciphers succeed almost 100% of the time to bypass the safety alignment of GPT-4 in several safety domains, demonstrating the necessity of developing safety alignment for non-natural languages. Notably, we identify that LLMs seem to have a ''secret cipher'', and propose a novel SelfCipher that uses only role play and several demonstrations in natural language to evoke this capability. SelfCipher surprisingly outperforms existing human ciphers in almost all cases. Our code and data will be released at https://github.com/RobustNLP/CipherChat.
翻译:安全性位于大型语言模型(LLM)发展的核心。当前已有大量研究致力于使LLM与人类伦理和偏好对齐,包括预训练中的数据过滤、监督微调、基于人类反馈的强化学习以及红队测试等。本研究发现,通过密码进行对话能够绕过主要基于自然语言的安全对齐技术。我们提出新型框架CipherChat,系统性地检验安全对齐对非自然语言(即密码)的泛化能力。CipherChat允许人类通过包含系统角色描述和少量密码示例的提示词与LLM进行加密对话。我们利用CipherChat评估了包括ChatGPT和GPT-4在内的先进LLM,在英语和中文的11个安全领域中测试了多种代表性人类密码。实验结果表明,在若干安全领域内,特定密码几乎100%地成功绕过了GPT-4的安全对齐机制,这凸显了为非自然语言开发安全对齐技术的必要性。值得注意的是,我们发现LLM似乎存在“秘密密码”,并据此提出新型SelfCipher方法——仅通过角色扮演和若干自然语言示例即可唤起该能力。令人惊讶的是,SelfCipher在几乎所有场景中均优于现有的人类密码。相关代码和数据集将发布于https://github.com/RobustNLP/CipherChat。