Large Language Models (LLMs) have demonstrated impressive capabilities in natural language tasks, but their safety and morality remain contentious due to their training on internet text corpora. To address these concerns, alignment techniques have been developed to improve the public usability and safety of LLMs. Yet, the potential for generating harmful content through these models seems to persist. This paper explores the concept of jailbreaking LLMs-reversing their alignment through adversarial triggers. Previous methods, such as soft embedding prompts, manually crafted prompts, and gradient-based automatic prompts, have had limited success on black-box models due to their requirements for model access and for producing a low variety of manually crafted prompts, making them susceptible to being blocked. This paper introduces a novel approach using reinforcement learning to optimize adversarial triggers, requiring only inference API access to the target model and a small surrogate model. Our method, which leverages a BERTScore-based reward function, enhances the transferability and effectiveness of adversarial triggers on new black-box models. We demonstrate that this approach improves the performance of adversarial triggers on a previously untested language model.
翻译:大型语言模型(LLMs)在自然语言任务中展现出卓越的能力,但由于其训练数据来源于互联网文本语料库,其安全性与道德性仍存在争议。为解决这些问题,对齐技术被开发出来以提升LLMs的公共可用性和安全性。然而,通过这些模型生成有害内容的可能性似乎依然存在。本文探讨了“越狱”LLMs的概念——即通过对抗性触发器逆转其对齐状态。先前的方法,如软嵌入提示、人工构建提示和基于梯度的自动提示,在黑盒模型上效果有限,因为它们需要模型访问权限且人工构建的提示多样性不足,易被拦截。本文提出了一种利用强化学习优化对抗性触发器的新方法,仅需目标模型的推理API访问权限和一个小型代理模型。我们的方法采用基于BERTScore的奖励函数,增强了对抗性触发器在新黑盒模型上的可迁移性和有效性。实验表明,该方法在先前未经测试的语言模型上提升了对抗性触发器的性能。