The rapid evolution of Large Language Models (LLMs) has rendered them indispensable in modern society. While security measures are typically to align LLMs with human values prior to release, recent studies have unveiled a concerning phenomenon named "Jailbreak". This term refers to the unexpected and potentially harmful responses generated by LLMs when prompted with malicious questions. Most existing research focus on generating jailbreak prompts but system message configurations vary significantly in experiments. In this paper, we aim to answer a question: Is the system message really important for jailbreaks in LLMs? We conduct experiments in mainstream LLMs to generate jailbreak prompts with varying system messages: short, long, and none. We discover that different system messages have distinct resistances to jailbreaks. Therefore, we explore the transferability of jailbreaks across LLMs with different system messages. Furthermore, we propose the System Messages Evolutionary Algorithm (SMEA) to generate system messages that are more resistant to jailbreak prompts, even with minor changes. Through SMEA, we get a robust system messages population with little change in the length of system messages. Our research not only bolsters LLMs security but also raises the bar for jailbreaks, fostering advancements in this field of study.
翻译:大语言模型的快速发展使其在现代社会中变得不可或缺。尽管通常在发布前会采取安全措施使大语言模型与人类价值观对齐,但近期研究揭示了一种令人担忧的现象,称为"越狱"。该术语指的是大语言模型在面对恶意提问时产生的意外且可能有害的回应。现有研究大多集中于生成越狱提示,但实验中的系统消息配置差异显著。本文旨在回答一个问题:系统消息对大语言模型中的越狱攻击真的重要吗?我们在主流大语言模型中进行实验,使用不同系统消息(短、长、无)生成越狱提示。我们发现不同系统消息对越狱攻击具有不同的抵抗能力。因此,我们探究了越狱攻击在使用不同系统消息的大语言模型间的可迁移性。此外,我们提出了系统消息进化算法,用于生成对越狱提示更具抵抗力的系统消息,即使仅作微小改动。通过该算法,我们获得了具有强鲁棒性的系统消息种群,且系统消息长度变化极小。我们的研究不仅增强了大语言模型的安全性,还提高了越狱攻击的门槛,推动了该研究领域的进步。