Large Language Models (LLMs) are aligned to moral and ethical guidelines but remain susceptible to creative prompts called Jailbreak that can bypass the alignment process. However, most jailbreaking prompts contain harmful questions in the natural language (mainly English), which can be detected by the LLM themselves. In this paper, we present jailbreaking prompts encoded using cryptographic techniques. We first present a pilot study on the state-of-the-art LLM, GPT-4, in decoding several safe sentences that have been encrypted using various cryptographic techniques and find that a straightforward word substitution cipher can be decoded most effectively. Motivated by this result, we use this encoding technique for writing jailbreaking prompts. We present a mapping of unsafe words with safe words and ask the unsafe question using these mapped words. Experimental results show an attack success rate (up to 59.42%) of our proposed jailbreaking approach on state-of-the-art proprietary models including ChatGPT, GPT-4, and Gemini-Pro. Additionally, we discuss the over-defensiveness of these models. We believe that our work will encourage further research in making these LLMs more robust while maintaining their decoding capabilities.
翻译:大型语言模型(LLMs)虽已与道德伦理准则对齐,但仍易受到名为“破解”的创意性提示攻击,这类提示可绕过对齐过程。然而,大多数破解提示以自然语言(主要是英语)包含有害问题,这使得LLM自身能够检测到它们。本文提出了一种利用加密技术编码的破解提示。我们首先对最先进的LLM——GPT-4——进行了一项预实验,测试其解码若干采用不同加密技术加密的安全句子,发现简单的单词替换密码最易被有效解码。基于此结果,我们采用该编码技术编写破解提示。我们提出了不安全单词与安全单词的映射关系,并使用这些映射后的单词提出不安全问题。实验结果表明,我们提出的破解方法在包括ChatGPT、GPT-4和Gemini-Pro在内的最先进专有模型上实现了高达59.42%的攻击成功率。此外,我们讨论了这些模型的过度防御性问题。我们相信,本工作将推动进一步研究,使这些LLM在保持解码能力的同时更加鲁棒。