The safety of large language models (LLMs) has increasingly emerged as a fundamental aspect of their development. Existing safety alignment for LLMs is predominantly achieved through post-training methods, which are computationally expensive and often fail to generalize well across different models. A small number of lightweight alignment approaches either rely heavily on prior-computed safety injections or depend excessively on the model's own capabilities, resulting in limited generalization and degraded efficiency and usability during generation. In this work, we propose a safety-aware decoding method that requires only low-cost training of an expert model and employs a single neuron as a gating mechanism. By effectively balancing the model's intrinsic capabilities with external guidance, our approach simultaneously preserves utility and enhances output safety. It demonstrates clear advantages in training overhead and generalization across model scales, offering a new perspective on lightweight alignment for the safe and practical deployment of large language models. Code: https://github.com/Beijing-AISI/NGSD.
翻译:大语言模型的安全性日益成为其发展的基本考量。现有的大语言模型安全对齐主要通过后训练方法实现,这些方法计算成本高昂,且往往难以在不同模型间良好泛化。少数轻量级对齐方法要么严重依赖预先计算的安全注入,要么过度依赖模型自身能力,导致泛化能力有限,且在生成过程中效率和可用性下降。本文提出一种安全感知的解码方法,仅需低成本训练一个专家模型,并采用单个神经元作为门控机制。通过有效平衡模型内在能力与外部指导,我们的方法在保持实用性的同时增强了输出安全性。该方法在训练开销和跨模型规模的泛化能力上展现出明显优势,为大语言模型的安全实用部署提供了轻量对齐的新视角。代码:https://github.com/Beijing-AISI/NGSD。