Large Language Models (LLMs) rely on safety alignment to produce socially acceptable responses. However, this behavior is known to be brittle: further fine-tuning, even on benign or lightly contaminated data, can degrade safety and reintroduce harmful behaviors. A growing body of work suggests that alignment may correspond to identifiable directions in weight space, forming subspaces that could, in principle, be isolated or preserved to defend against misalignment. In this work, we conduct a comprehensive empirical study of this perspective. We examine whether safety-relevant behavior is concentrated in specific linear subspaces, whether it can be separated from general-purpose learning, and whether harmfulness arises from distinguishable patterns in activations. Across both weight and activation spaces, our findings are consistent: subspaces that amplify safe behaviors also amplify useful ones, and prompts with different safety implications activate overlapping representations. Rather than residing in distinct directions, we show that safety is highly entangled with the general learning components of the model. This suggests that subspace-based defenses face fundamental limitations and underscores the need for alternative strategies to preserve safety under continued training. We corroborate these findings with multiple experiments on five open-source LLMs from the Llama and Qwen families. Our code is publicly available at: https://github.com/CERT-Lab/safety-subspaces.
翻译:大型语言模型(LLM)依赖安全对齐来生成符合社会规范的响应。然而,这种行为已知具有脆弱性:即使对良性或轻度污染数据进行进一步微调,也可能降低安全性并重新引入有害行为。越来越多的研究表明,对齐可能对应权重空间中的可识别方向,形成原则上可被隔离或保留以防御错位的子空间。本研究对这一观点进行了全面的实证分析。我们检验了安全相关行为是否集中于特定线性子空间、能否与通用学习目标分离,以及有害性是否源自激活中的可区分模式。在权重空间和激活空间中,我们的发现具有一致性:增强安全行为的子空间同样会增强实用行为,且不同安全属性的提示会激活重叠的表征。研究表明,安全性与模型的通用学习成分高度纠缠,而非存在于独立方向中。这意味着基于子空间的防御方法存在根本性局限,并凸显了在持续训练中维持安全性需要替代策略的必要性。我们通过对Llama和Qwen系列的五个开源LLM进行多组实验验证了这些结论。代码已公开于:https://github.com/CERT-Lab/safety-subspaces。