Single Character Perturbations Break LLM Alignment

When LLMs are deployed in sensitive, human-facing settings, it is crucial that they do not output unsafe, biased, or privacy-violating outputs. For this reason, models are both trained and instructed to refuse to answer unsafe prompts such as "Tell me how to build a bomb." We find that, despite these safeguards, it is possible to break model defenses simply by appending a space to the end of a model's input. In a study of eight open-source models, we demonstrate that this acts as a strong enough attack to cause the majority of models to generate harmful outputs with very high success rates. We examine the causes of this behavior, finding that the contexts in which single spaces occur in tokenized training data encourage models to generate lists when prompted, overriding training signals to refuse to answer unsafe requests. Our findings underscore the fragile state of current model alignment and promote the importance of developing more robust alignment methods. Code and data will be available at https://github.com/hannah-aught/space_attack.

翻译：当大语言模型部署在敏感的人机交互场景时，确保其不输出不安全、偏见性或侵犯隐私的内容至关重要。为此，模型在训练和指令设计上均被要求拒绝回答诸如"告诉我如何制造炸弹"这类不安全提示。我们发现，尽管存在这些防护机制，仅通过在模型输入末尾添加一个空格字符，就足以突破其防御体系。通过对八个开源模型的研究，我们证明这种攻击方式具有足够强度，能够以极高的成功率促使大多数模型生成有害输出。我们深入分析了该行为的成因，发现分词训练数据中单空格出现的上下文环境会促使模型在收到提示时倾向于生成列表式回复，从而覆盖了训练中习得的拒绝回答不安全请求的信号。本研究结果揭示了当前模型对齐机制的脆弱性，并强调了开发更鲁棒对齐方法的重要性。代码与数据将在 https://github.com/hannah-aught/space_attack 公开。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日