We study controlled post-training refusal suppression in routed MoE and hybrid-MoE foundation models, aiming to increase non-refusal target-response behavior while preserving general capability under a compact intervention footprint. Existing broad direction-based edits can perturb general-purpose computation, whereas support-only expert edits often lack sufficient capacity to correct heterogeneous refusal representations. To address this limitation, we introduce Localized Multidirectional Correction (LoMC), a support-gated intervention framework that follows a support-then-correction execution order: it first identifies a compact edit support, then aggregates prototype correction directions into layer-wise correction directions, and finally applies rank-one layer-wise correction only within the selected support. By using the edit support as a structural gating constraint, LoMC increases correction capacity without expanding the intervention scope. Experiments on text-only and multimodal safety benchmarks across four routed backbones show that LoMC substantially improves non-refusal target-response behavior while maintaining general capability under a compact intervention footprint.
翻译:我们研究了路由MoE与混合MoE基础模型中受控的后训练拒绝行为抑制,旨在紧凑干预足迹下,在提升非拒绝目标响应行为的同时保持通用能力。现有基于宽泛方向的编辑会扰动通用计算,而仅支持专家的编辑则缺乏足够容量修正异质拒绝表示。为应对这一局限,我们提出局部化多方向修正(LoMC),一种支持门控干预框架,遵循“先支持后修正”执行顺序:首先识别紧凑编辑支持域,随后将原型修正方向聚合为逐层修正方向,最后仅在选定支持域内应用秩一逐层修正。通过将编辑支持域作为结构性门控约束,LoMC在不扩展干预范围的前提下增强了修正容量。在四个路由骨干架构的纯文本与多模态安全基准实验表明,LoMC在紧凑干预足迹下显著提升了非拒绝目标响应行为,同时保持通用能力。