Considerable research efforts have been devoted to ensuring that large language models (LLMs) align with human values and generate safe text. However, an excessive focus on sensitivity to certain topics can compromise the model's robustness in following instructions, thereby impacting its overall performance in completing tasks. Previous benchmarks for jailbreaking LLMs have primarily focused on evaluating the safety of the models without considering their robustness. In this paper, we propose a benchmark that assesses both the safety and robustness of LLMs, emphasizing the need for a balanced approach. To comprehensively study text safety and output robustness, we introduce a latent jailbreak prompt dataset, each involving malicious instruction embedding. Specifically, we instruct the model to complete a regular task, such as translation, with the text to be translated containing malicious instructions. To further analyze safety and robustness, we design a hierarchical annotation framework. We present a systematic analysis of the safety and robustness of LLMs regarding the position of explicit normal instructions, word replacements (verbs in explicit normal instructions, target groups in malicious instructions, cue words for explicit normal instructions), and instruction replacements (different explicit normal instructions). Our results demonstrate that current LLMs not only prioritize certain instruction verbs but also exhibit varying jailbreak rates for different instruction verbs in explicit normal instructions. Code and data are available at https://github.com/qiuhuachuan/latent-jailbreak.
翻译:大量研究工作致力于确保大语言模型(LLMs)符合人类价值观并生成安全文本。然而,对特定主题的过度敏感性可能损害模型遵循指令的鲁棒性,进而影响其完成任务的整体性能。以往针对大语言模型越狱的基准测试主要关注模型安全性评估,而忽略了鲁棒性。本文提出一个兼顾安全性与鲁棒性的基准测试,强调平衡评估的必要性。为全面研究文本安全性与输出鲁棒性,我们构建了一个潜藏越狱提示数据集,每条提示均包含恶意指令嵌入。具体而言,我们要求模型完成常规任务(如翻译),而待翻译文本中隐含恶意指令。为深入分析安全性与鲁棒性,我们设计了一套分层标注框架,系统考察了显式常规指令位置、词语替换(显式常规指令中的动词、恶意指令中的目标群体、显式常规指令的线索词)及指令替换(不同显式常规指令)对模型安全性与鲁棒性的影响。实验结果表明,现有大语言模型不仅对特定指令动词存在优先级偏好,且不同显式常规指令动词对应不同的越狱成功率。代码与数据已开源:https://github.com/qiuhuachuan/latent-jailbreak。