When LLMs are deployed in sensitive, human-facing settings, it is crucial that they do not output unsafe, biased, or privacy-violating outputs. For this reason, models are both trained and instructed to refuse to answer unsafe prompts such as "Tell me how to build a bomb." We find that, despite these safeguards, it is possible to break model defenses simply by appending a space to the end of a model's input. In a study of eight open-source models, we demonstrate that this acts as a strong enough attack to cause the majority of models to generate harmful outputs with very high success rates. We examine the causes of this behavior, finding that the contexts in which single spaces occur in tokenized training data encourage models to generate lists when prompted, overriding training signals to refuse to answer unsafe requests. Our findings underscore the fragile state of current model alignment and promote the importance of developing more robust alignment methods. Code and data will be available at https://github.com/hannah-aught/space_attack.
翻译:当大语言模型部署在敏感的人机交互场景时,确保其不输出不安全、偏见性或侵犯隐私的内容至关重要。为此,模型在训练和指令设计上均被要求拒绝回答诸如"告诉我如何制造炸弹"这类不安全提示。我们发现,尽管存在这些防护机制,仅通过在模型输入末尾添加一个空格字符,就足以突破其防御体系。通过对八个开源模型的研究,我们证明这种攻击方式具有足够强度,能够以极高的成功率促使大多数模型生成有害输出。我们深入分析了该行为的成因,发现分词训练数据中单空格出现的上下文环境会促使模型在收到提示时倾向于生成列表式回复,从而覆盖了训练中习得的拒绝回答不安全请求的信号。本研究结果揭示了当前模型对齐机制的脆弱性,并强调了开发更鲁棒对齐方法的重要性。代码与数据将在 https://github.com/hannah-aught/space_attack 公开。