Large language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely bypass them, and fine-tuning on narrow domains can induce ``emergent misalignment'' that generalizes broadly. Whether this brittleness reflects a fundamental lack of coherent internal organization for harmfulness remains unclear. Here we use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs. We find that harmful content generation depends on a compact set of weights that are general across harm types and distinct from benign capabilities. Aligned models exhibit a greater compression of harm generation weights than unaligned counterparts, indicating that alignment reshapes harmful representations internally--despite the brittleness of safety guardrails at the surface level. This compression explains emergent misalignment: if weights of harmful capabilities are compressed, fine-tuning that engages these weights in one domain can trigger broad misalignment. Consistent with this, pruning harm generation weights in a narrow domain substantially reduces emergent misalignment. Notably, LLMs harmful generation capability is dissociated from how they recognize and explain such content. Together, these results reveal a coherent internal structure for harmfulness in LLMs that may serve as a foundation for more principled approaches to safety.
翻译:大型语言模型(LLMs)经过对齐训练以避免有害行为,但其防护机制依然脆弱:越狱攻击可轻易绕过这些机制,而针对窄领域进行微调会诱发广泛泛化的“突发性失调”。这种脆弱性是否源于模型内部缺乏关于有害性的连贯组织,目前尚不明确。本研究以靶向权重剪枝作为因果干预手段,探测LLMs中有害性的内部组织。我们发现,有害内容生成依赖于一组紧凑的权重,这些权重对不同类型的有害行为具有通用性,且与良性能力相分离。对齐模型中有害生成权重的压缩程度高于未对齐模型,这表明对齐在内部重塑了有害表征——尽管表面层级的安全护栏仍显脆弱。这种压缩解释了突发性失调现象:若有害能力的权重被压缩,在一个领域内激活这些权重的微调会触发广泛的失调。与之相符,在窄领域内剪除有害生成权重可显著降低突发性失调。值得注意的是,LLMs的有害生成能力与其识别和解释此类内容的能力存在分离。综合而言,这些结果揭示了LLMs中有害性的连贯内部结构,或为更严谨的安全方法奠定基础。