Ensuring the safe alignment of large language models (LLMs) with human values is critical as they become integral to applications like translation and question answering. Current alignment methods struggle with dynamic user intentions and complex objectives, making models vulnerable to generating harmful content. We propose Safety Arithmetic, a training-free framework enhancing LLM safety across different scenarios: Base models, Supervised fine-tuned models (SFT), and Edited models. Safety Arithmetic involves Harm Direction Removal to avoid harmful content and Safety Alignment to promote safe responses. Additionally, we present NoIntentEdit, a dataset highlighting edit instances that could compromise model safety if used unintentionally. Our experiments show that Safety Arithmetic significantly improves safety measures, reduces over-safety, and maintains model utility, outperforming existing methods in ensuring safe content generation.
翻译:确保大型语言模型(LLMs)与人类价值观的安全对齐至关重要,因为它们已深度集成于翻译、问答等应用场景。现有对齐方法难以适应动态的用户意图与复杂目标,导致模型易生成有害内容。本文提出安全算术——一种无需训练的框架,可提升基座模型、监督微调模型(SFT)及编辑模型在不同场景下的安全性。该框架包含两个核心机制:通过有害方向消除避免生成有害内容,以及通过安全对齐促进安全响应。此外,我们构建了NoIntentEdit数据集,用于揭示无意使用可能损害模型安全性的编辑案例。实验表明,安全算术能显著提升安全指标、降低过度安全倾向,同时保持模型效用,在保障内容安全生成方面优于现有方法。