Despite their capabilities, large foundation models (LFMs) remain susceptible to adversarial manipulation. Current defenses predominantly rely on the "locality hypothesis", suppressing isolated neurons or features. However, harmful semantics act as distributed, cross-layer circuits, rendering such localized interventions brittle and detrimental to utility. To bridge this gap, we propose \textbf{TraceRouter}, a path-level framework that traces and disconnects the causal propagation circuits of illicit semantics. TraceRouter operates in three stages: (1) it pinpoints a sensitive onset layer by analyzing attention divergence; (2) it leverages sparse autoencoders (SAEs) and differential activation analysis to disentangle and isolate malicious features; and (3) it maps these features to downstream causal pathways via feature influence scores (FIS) derived from zero-out interventions. By selectively suppressing these causal chains, TraceRouter physically severs the flow of harmful information while leaving orthogonal computation routes intact. Extensive experiments demonstrate that TraceRouter significantly outperforms state-of-the-art baselines, achieving a superior trade-off between adversarial robustness and general utility. Our code will be publicly released. WARNING: This paper contains unsafe model responses.
翻译:尽管大型基础模型(LFMs)具备强大能力,但其仍易受到对抗性操纵。现有防御方法主要依赖“局部性假设”,通过抑制孤立的神经元或特征来实现防护。然而,有害语义表现为分布式、跨层的电路模式,使得此类局部干预措施既脆弱又损害模型效用。为弥补这一不足,我们提出\textbf{TraceRouter}——一种路径级框架,通过追踪并阻断非法语义的因果传播回路来增强安全性。TraceRouter 分三个阶段运行:(1)通过分析注意力差异定位敏感起始层;(2)利用稀疏自编码器(SAEs)与差分激活分析来解耦并隔离恶意特征;(3)借助基于归零干预得到的特征影响分数(FIS),将这些特征映射至下游因果路径。通过选择性抑制这些因果链,TraceRouter 从物理上切断了有害信息的流动,同时保持正交的计算路径不受影响。大量实验表明,TraceRouter 显著优于现有先进基线方法,在对抗鲁棒性与通用效用之间实现了更优的权衡。我们的代码将公开释放。警告:本文包含不安全的模型响应。