Large language models (LLMs) and multimodal LLMs are typically safety-aligned before release to prevent harmful content generation. However, recent studies show that safety behaviors are concentrated in a small subset of parameters, making alignment brittle and easily bypassed through neuron-level attacks. Moreover, most existing alignment methods operate at the behavioral level, offering limited control over the model's internal safety mechanisms. In this work, we propose SafeNeuron, a neuron-level safety alignment framework that improves robustness by redistributing safety representations across the network. SafeNeuron first identifies safety-related neurons, then freezes these neurons during preference optimization to prevent reliance on sparse safety pathways and force the model to construct redundant safety representations. Extensive experiments across models and modalities demonstrate that SafeNeuron significantly improves robustness against neuron pruning attacks, reduces the risk of open-source models being repurposed as red-team generators, and preserves general capabilities. Furthermore, our layer-wise analysis reveals that safety behaviors are governed by stable and shared internal representations. Overall, SafeNeuron provides an interpretable and robust perspective for model alignment.
翻译:大语言模型(LLMs)及多模态大语言模型通常在发布前会进行安全对齐,以防止生成有害内容。然而,近期研究表明,安全行为集中于一小部分参数中,导致对齐机制脆弱且易受神经元级攻击绕过。此外,现有对齐方法大多在行为层面操作,对模型内部安全机制的控制有限。本文提出SafeNeuron,一种神经元级安全对齐框架,其通过在网络中重新分布安全表征来提升鲁棒性。SafeNeuron首先识别与安全相关的神经元,随后在偏好优化过程中冻结这些神经元,以避免模型依赖稀疏的安全路径,并迫使其构建冗余的安全表征。跨模型与跨模态的广泛实验表明,SafeNeuron显著提升了模型对抗神经元剪枝攻击的鲁棒性,降低了开源模型被恶意改造成红队生成器的风险,同时保持了模型的通用能力。此外,我们的分层分析揭示,安全行为由稳定且共享的内部表征所支配。总体而言,SafeNeuron为模型对齐提供了一个可解释且鲁棒的新视角。