Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models

Researchers have invested considerable effort into ensuring that large language models (LLMs) align with human values, using various training techniques, such as instruction tuning and Reinforcement Learning from Human or AI Feedback (RLHF/RLAIF), to guard against text unsafety. However, these defenses remain incredibly vulnerable to some jailbreak attacks, which can cause the model to become overly defensive to sensitive topics or still generate harmful content, leaving the model performance particularly fragile. Therefore, to comprehensively study text safety and output robustness, we propose a latent jailbreak prompt dataset, each involving malicious instruction embedding. Specifically, we instruct the model to complete a regular task, such as translation, where the text to be translated contains malicious instructions. To further analyze the safety and robustness, we design a hierarchical annotation framework. We present a systematic analysis of the safety and robustness of LLMs concerning the position of explicit normal instructions, word replacement (verbs in explicit normal instructions, target groups in malicious instructions, cue words in malicious instructions), and instruction replacement (different explicit normal instructions). Our results show that current LLMs not only have a preference for certain instruction verbs, but also exhibit different jailbreak rates for different instruction verbs in explicit normal instructions. In other words, the probability of generating unsafe content by the model will be reinforced to varying degrees depending on the instruction verb in explicit normal instructions. Code and data are available at https://github.com/qiuhuachuan/latent-jailbreak.

翻译：研究人员投入大量精力，通过指令微调、基于人类或AI反馈的强化学习（RLHF/RLAIF）等训练技术，确保大型语言模型（LLMs）与人类价值观对齐，从而防范文本不安全问题。然而，这些防御措施在面对某些越狱攻击时仍极其脆弱——攻击可能使模型对敏感话题过度防御，或仍生成有害内容，导致模型性能尤为不稳定。因此，为全面研究文本安全性与输出鲁棒性，我们提出一个潜在越狱提示数据集，每条数据均嵌入恶意指令。具体而言，我们指示模型完成常规任务（如翻译），而待翻译文本中隐含恶意指令。为深入分析安全性与鲁棒性，我们设计了一套层级化标注框架，系统分析了LLMs在显式正常指令位置、词语替换（显式正常指令中的动词、恶意指令中的目标群体、恶意指令中的提示词）以及指令替换（不同显式正常指令）方面的安全性与鲁棒性表现。结果表明，当前LLMs不仅对特定指令动词存在偏好，且不同指令动词在显式正常指令中会引发差异化的越狱成功率。换言之，模型生成不安全内容的概率会因显式正常指令中的动词不同而受到不同程度的强化。代码与数据见 https://github.com/qiuhuachuan/latent-jailbreak。