Mixture-of-Experts (MoE) architectures in Large Language Models (LLMs) have significantly reduced inference costs through sparse activation. However, this sparse activation paradigm also introduces new safety challenges. Since only a subset of experts is engaged for each input, model behavior becomes coupled to routing decisions, yielding a difficult-to-control mechanism that can vary across safety-relevant scenarios. At the same time, adapting model behavior through full fine-tuning or retraining is costly, especially when developers need to rapidly configure the same model for different safety objectives. We present MASCing (MoE Activation Steering Configuration), the first framework that enables flexible reconfiguration of MoE behavior across diverse safety scenarios without retraining. MASCing uses an LSTM-based surrogate model to capture cross-layer routing dependencies and map routing logits to downstream behaviors. It then optimizes a steering matrix to identify behavior-relevant expert circuits and, at inference time, applies steering masks to the routing gates to override expert selection. This enables targeted enhancement or suppression of specific behaviors while preserving general language utility. To demonstrate its reconfigurability, we apply MASCing to two different safety-related objectives and observe consistent gains with negligible overhead across seven open-source MoE models. For multi-turn jailbreak defense, it improves the average defense success rate from 52.5% to 83.9%, with gains of up to 89.2%. For adult-content generation, MASCing enables models to comply with such requests that would otherwise be refused, increasing the average generation success rate from 52.6% to 82.0%, with gains of up to 93.0%. These results establish MASCing as a practical, lightweight, and flexible framework for scenario-specific safety reconfiguration in MoE models.
翻译:混合专家(Mixture-of-Experts, MoE)架构通过稀疏激活显著降低了大语言模型(LLMs)的推理成本。然而,这种稀疏激活范式也引入了新的安全挑战。由于每个输入仅激活部分专家,模型行为与路由决策紧密耦合,形成一种难以控制的机制,且可能随安全相关场景的不同而变化。同时,通过全量微调或重新训练来调整模型行为成本高昂,尤其是在开发者需要针对不同安全目标快速配置同一模型的情况下。我们提出MASCing(MoE激活导向配置),这是首个无需重新训练即可在不同安全场景下灵活重新配置MoE行为的框架。MASCing利用基于LSTM的代理模型捕捉跨层路由依赖关系,并将路由logits映射为下游行为。随后,它优化导向矩阵以识别与行为相关的专家通路,并在推理时对路由门控施加导向掩码以覆盖专家选择。这使得在保持通用语言能力的同时,能够针对性地增强或抑制特定行为。为展示其可重构性,我们将MASCing应用于两个不同的安全目标,并在七个开源MoE模型上观察到一致性的性能提升,且计算开销可忽略不计。在多轮越狱防御任务中,其将平均防御成功率从52.5%提升至83.9%,最高增益达89.2%。在成人内容生成任务中,MASCing使模型能够顺从本会被拒绝的请求,将平均生成成功率从52.6%提升至82.0%,最高增益达93.0%。这些结果确立了MASCing作为一种实用、轻量且灵活的框架,适用于MoE模型中场景特定的安全重配置。