GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models

The discovery of "jailbreaks" to bypass safety filters of Large Language Models (LLMs) and harmful responses have encouraged the community to implement safety measures. One major safety measure is to proactively test the LLMs with jailbreaks prior to the release. Therefore, such testing will require a method that can generate jailbreaks massively and efficiently. In this paper, we follow a novel yet intuitive strategy to generate jailbreaks in the style of the human generation. We propose a role-playing system that assigns four different roles to the user LLMs to collaborate on new jailbreaks. Furthermore, we collect existing jailbreaks and split them into different independent characteristics using clustering frequency and semantic patterns sentence by sentence. We organize these characteristics into a knowledge graph, making them more accessible and easier to retrieve. Our system of different roles will leverage this knowledge graph to generate new jailbreaks, which have proved effective in inducing LLMs to generate unethical or guideline-violating responses. In addition, we also pioneer a setting in our system that will automatically follow the government-issued guidelines to generate jailbreaks to test whether LLMs follow the guidelines accordingly. We refer to our system as GUARD (Guideline Upholding through Adaptive Role-play Diagnostics). We have empirically validated the effectiveness of GUARD on three cutting-edge open-sourced LLMs (Vicuna-13B, LongChat-7B, and Llama-2-7B), as well as a widely-utilized commercial LLM (ChatGPT). Moreover, our work extends to the realm of vision language models (MiniGPT-v2 and Gemini Vision Pro), showcasing GUARD's versatility and contributing valuable insights for the development of safer, more reliable LLM-based applications across diverse modalities.

翻译：“越狱”（jailbreaks）绕过大型语言模型（LLMs）安全过滤器并生成有害回复的现象，已促使相关社区实施安全措施。一项关键安全措施是在LLMs发布前主动使用越狱对其进行测试。因此，此类测试需要一种能够大规模且高效生成越狱的方法。本文遵循一种新颖且直观的策略，以人类生成越狱的风格进行尝试。我们提出一个角色扮演系统，为用户LLMs分配四种不同角色以协作生成新型越狱。此外，我们收集现有越狱，并通过逐句聚类频率与语义模式将其拆分为独立特征。我们将这些特征组织成知识图谱，使其更易访问和检索。该多角色系统将利用此知识图谱生成新的越狱，这些越狱已被证明能有效诱导LLMs生成不道德或违反准则的回复。同时，我们还在系统中开创性地引入一种设置，可自动跟随政府发布的准则生成越狱，以测试LLMs是否遵循相应准则。我们将该系统称为GUARD（通过自适应角色扮演诊断进行准则维护）。我们在三款尖端开源LLMs（Vicuna-13B、LongChat-7B和Llama-2-7B）以及广泛使用的商业LLM（ChatGPT）上实证验证了GUARD的有效性。此外，我们的工作还扩展至视觉语言模型领域（MiniGPT-v2和Gemini Vision Pro），展示了GUARD的通用性，并为开发跨模态的更安全、更可靠的基于LLM的应用贡献了宝贵见解。