Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models

Considerable research efforts have been devoted to ensuring that large language models (LLMs) align with human values and generate safe text. However, an excessive focus on sensitivity to certain topics can compromise the model's robustness in following instructions, thereby impacting its overall performance in completing tasks. Previous benchmarks for jailbreaking LLMs have primarily focused on evaluating the safety of the models without considering their robustness. In this paper, we propose a benchmark that assesses both the safety and robustness of LLMs, emphasizing the need for a balanced approach. To comprehensively study text safety and output robustness, we introduce a latent jailbreak prompt dataset, each involving malicious instruction embedding. Specifically, we instruct the model to complete a regular task, such as translation, with the text to be translated containing malicious instructions. To further analyze safety and robustness, we design a hierarchical annotation framework. We present a systematic analysis of the safety and robustness of LLMs regarding the position of explicit normal instructions, word replacements (verbs in explicit normal instructions, target groups in malicious instructions, cue words for explicit normal instructions), and instruction replacements (different explicit normal instructions). Our results demonstrate that current LLMs not only prioritize certain instruction verbs but also exhibit varying jailbreak rates for different instruction verbs in explicit normal instructions. Code and data are available at https://github.com/qiuhuachuan/latent-jailbreak.

翻译：摘要：大量研究工作致力于确保大语言模型与人类价值观对齐并生成安全的文本。然而，对某些话题敏感性的过度关注可能会损害模型遵循指令的鲁棒性，进而影响其完成任务的整体性能。以往的LLM越狱基准测试主要侧重于评估模型的安全性，而未考虑其鲁棒性。本文提出了一种同时评估LLM安全性和鲁棒性的基准测试，强调了平衡方法的必要性。为全面研究文本安全性与输出鲁棒性，我们引入了一个潜在越狱提示数据集，每条数据均包含恶意指令嵌入。具体而言，我们指示模型完成常规任务（如翻译），而待翻译的文本中嵌入了恶意指令。为深入分析安全性与鲁棒性，我们设计了一个分层标注框架。针对显式正常指令的位置、词语替换（显式正常指令中的动词、恶意指令中的目标群体、显式正常指令的提示词）以及指令替换（不同的显式正常指令），我们系统地分析了LLM的安全性与鲁棒性。实验结果表明，当前LLM不仅对某些指令动词存在偏好，而且针对显式正常指令中不同动词的越狱率也存在差异。代码与数据可在 https://github.com/qiuhuachuan/latent-jailbreak 获取。