Can Reinforcement Learning Unlock the Hidden Dangers in Aligned Large Language Models?

Large Language Models (LLMs) have demonstrated impressive capabilities in natural language tasks, but their safety and morality remain contentious due to their training on internet text corpora. To address these concerns, alignment techniques have been developed to improve the public usability and safety of LLMs. Yet, the potential for generating harmful content through these models seems to persist. This paper explores the concept of jailbreaking LLMs-reversing their alignment through adversarial triggers. Previous methods, such as soft embedding prompts, manually crafted prompts, and gradient-based automatic prompts, have had limited success on black-box models due to their requirements for model access and for producing a low variety of manually crafted prompts, making them susceptible to being blocked. This paper introduces a novel approach using reinforcement learning to optimize adversarial triggers, requiring only inference API access to the target model and a small surrogate model. Our method, which leverages a BERTScore-based reward function, enhances the transferability and effectiveness of adversarial triggers on new black-box models. We demonstrate that this approach improves the performance of adversarial triggers on a previously untested language model.

翻译：大型语言模型（LLMs）在自然语言任务中展现出卓越的能力，但由于其训练数据来源于互联网文本语料库，其安全性与道德性仍存在争议。为解决这些问题，对齐技术被开发出来以提升LLMs的公共可用性和安全性。然而，通过这些模型生成有害内容的可能性似乎依然存在。本文探讨了“越狱”LLMs的概念——即通过对抗性触发器逆转其对齐状态。先前的方法，如软嵌入提示、人工构建提示和基于梯度的自动提示，在黑盒模型上效果有限，因为它们需要模型访问权限且人工构建的提示多样性不足，易被拦截。本文提出了一种利用强化学习优化对抗性触发器的新方法，仅需目标模型的推理API访问权限和一个小型代理模型。我们的方法采用基于BERTScore的奖励函数，增强了对抗性触发器在新黑盒模型上的可迁移性和有效性。实验表明，该方法在先前未经测试的语言模型上提升了对抗性触发器的性能。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日