Ensuring the safe alignment of large language models (LLMs) with human values is critical as they become integral to applications like translation and question answering. Current alignment methods struggle with dynamic user intentions and complex objectives, making models vulnerable to generating harmful content. We propose Safety Arithmetic, a training-free framework enhancing LLM safety across different scenarios: Base models, Supervised fine-tuned models (SFT), and Edited models. Safety Arithmetic involves Harm Direction Removal to avoid harmful content and Safety Alignment to promote safe responses. Additionally, we present NoIntentEdit, a dataset highlighting edit instances that could compromise model safety if used unintentionally. Our experiments show that Safety Arithmetic significantly improves safety measures, reduces over-safety, and maintains model utility, outperforming existing methods in ensuring safe content generation.
翻译:确保大型语言模型(LLMs)与人类价值观的安全对齐至关重要,因为它们已日益成为翻译和问答等应用的核心组成部分。现有的对齐方法难以应对动态的用户意图和复杂的目标,导致模型易生成有害内容。我们提出安全算术,一种无需训练的框架,用于增强LLM在不同场景下的安全性:基础模型、监督微调模型(SFT)以及编辑后模型。安全算术包含有害方向消除以避免有害内容,以及安全对齐以促进安全响应。此外,我们提出了NoIntentEdit数据集,该数据集突显了若无意中使用可能损害模型安全性的编辑实例。实验表明,安全算术显著提升了安全指标,减少了过度安全化,并保持了模型效用,在确保安全内容生成方面优于现有方法。