Safety lies at the core of the development of Large Language Models (LLMs). There is ample work on aligning LLMs with human ethics and preferences, including data filtering in pretraining, supervised fine-tuning, reinforcement learning from human feedback, and red teaming, etc. In this study, we discover that chat in cipher can bypass the safety alignment techniques of LLMs, which are mainly conducted in natural languages. We propose a novel framework CipherChat to systematically examine the generalizability of safety alignment to non-natural languages -- ciphers. CipherChat enables humans to chat with LLMs through cipher prompts topped with system role descriptions and few-shot enciphered demonstrations. We use CipherChat to assess state-of-the-art LLMs, including ChatGPT and GPT-4 for different representative human ciphers across 11 safety domains in both English and Chinese. Experimental results show that certain ciphers succeed almost 100% of the time to bypass the safety alignment of GPT-4 in several safety domains, demonstrating the necessity of developing safety alignment for non-natural languages. Notably, we identify that LLMs seem to have a ''secret cipher'', and propose a novel SelfCipher that uses only role play and several demonstrations in natural language to evoke this capability. SelfCipher surprisingly outperforms existing human ciphers in almost all cases. Our code and data will be released at https://github.com/RobustNLP/CipherChat.
翻译:安全是大语言模型(LLMs)发展的核心。已有大量工作致力于将LLMs与人类伦理和偏好对齐,包括预训练中的数据过滤、监督微调、基于人类反馈的强化学习以及红队测试等。本研究发现,以密码形式进行的对话能够规避主要基于自然语言的安全对齐技术。我们提出新颖的框架CipherChat,系统性地检验安全对齐向非自然语言(即密码)的泛化能力。CipherChat通过结合系统角色描述和少量加密示例的密码提示,使人类能够与LLMs进行对话。我们利用CipherChat评估了包括ChatGPT和GPT-4在内的最先进LLMs,在11个安全领域的英文和中文场景中,针对不同代表性人类密码进行测试。实验结果表明,在某些安全领域,特定密码几乎100%成功绕过GPT-4的安全对齐,凸显了为非自然语言开发安全对齐的必要性。值得注意的是,我们发现LLMs似乎存在一种"秘密密码",并据此提出新颖的SelfCipher方法——仅通过角色扮演和少量自然语言示例即可激发此能力。SelfCipher在几乎所有情况下均显著优于现有的人类密码。我们的代码和数据将发布于https://github.com/RobustNLP/CipherChat。