Mixture-of-Experts (MoE) language models introduce unique challenges for safety alignment due to their sparse routing mechanisms, which can enable degenerate optimization behaviors under standard full-parameter fine-tuning. In our preliminary experiments, we observe that naively applying full-parameter safety fine-tuning to MoE models can reduce attack success rates through routing or expert dominance effects, rather than by directly repairing Safety-Critical Experts. To address this challenge, we propose RASA, a routing-aware expert-level alignment framework that explicitly repairs Safety-Critical Experts while preventing routing-based bypasses. RASA identifies experts disproportionately activated by successful jailbreaks, selectively fine-tunes only these experts under fixed routing, and subsequently enforces routing consistency with safety-aligned contexts. Across two representative MoE architectures and a diverse set of jailbreak attacks, RASA achieves near-perfect robustness, strong cross-attack generalization, and substantially reduced over-refusal, while preserving general capabilities on benchmarks such as MMLU, GSM8K, and TruthfulQA. Our results suggest that robust MoE safety alignment benefits from targeted expert repair rather than global parameter updates, offering a practical and architecture-preserving alternative to prior approaches.
翻译:专家混合(MoE)语言模型因其稀疏路由机制给安全对齐带来了独特挑战,该机制可能导致标准全参数微调下的退化优化行为。在初步实验中,我们观察到对MoE模型简单应用全参数安全微调可能通过路由或专家主导效应降低攻击成功率,而非直接修复安全关键专家。为解决这一挑战,我们提出RASA——一种路由感知的专家级对齐框架,该框架在显式修复安全关键专家的同时防止基于路由的规避行为。RASA识别由成功越狱攻击过度激活的专家,在固定路由下仅选择性微调这些专家,随后强制其与安全对齐语境保持路由一致性。在两种代表性MoE架构和多样化越狱攻击集的测试中,RASA实现了近乎完美的鲁棒性、强大的跨攻击泛化能力及显著降低的过度拒绝现象,同时在MMLU、GSM8K和TruthfulQA等基准测试中保持了通用能力。我们的结果表明:相较于全局参数更新,鲁棒的MoE安全对齐更受益于针对性专家修复,这为现有方法提供了一种实用且保持架构特性的替代方案。