Large Language Models (LLMs) have demonstrated exceptional capabilities across various natural language processing tasks. Due to their training on internet-sourced datasets, LLMs can sometimes generate objectionable content, necessitating extensive alignment with human feedback to avoid such outputs. Despite massive alignment efforts, LLMs remain susceptible to adversarial jailbreak attacks, which usually are manipulated prompts designed to circumvent safety mechanisms and elicit harmful responses. Here, we introduce a novel approach, Directed Rrepresentation Optimization Jailbreak (DROJ), which optimizes jailbreak prompts at the embedding level to shift the hidden representations of harmful queries towards directions that are more likely to elicit affirmative responses from the model. Our evaluations on LLaMA-2-7b-chat model show that DROJ achieves a 100\% keyword-based Attack Success Rate (ASR), effectively preventing direct refusals. However, the model occasionally produces repetitive and non-informative responses. To mitigate this, we introduce a helpfulness system prompt that enhances the utility of the model's responses. Our code is available at https://github.com/Leon-Leyang/LLM-Safeguard.
翻译:大型语言模型(LLMs)在各种自然语言处理任务中展现出卓越的能力。由于它们基于互联网来源的数据集进行训练,LLMs有时可能生成不当内容,因此需要通过大量的人类反馈对齐来避免此类输出。尽管进行了大规模的对齐工作,LLMs仍然容易受到对抗性越狱攻击的影响,这类攻击通常通过精心设计的提示来规避安全机制,从而诱导有害响应。本文提出了一种新方法——定向表示优化越狱(DROJ),该方法在嵌入层面对越狱提示进行优化,将有害查询的隐藏表示向更可能引发模型肯定响应的方向偏移。我们在LLaMA-2-7b-chat模型上的评估表明,DROJ实现了100%基于关键词的攻击成功率(ASR),有效避免了直接拒绝。然而,模型偶尔会产生重复且无信息量的响应。为缓解此问题,我们引入了一个有益性系统提示,以增强模型响应的实用性。我们的代码可在https://github.com/Leon-Leyang/LLM-Safeguard获取。