By introducing routers to selectively activate experts in Transformer layers, the mixture-of-experts (MoE) architecture significantly reduces computational costs in large language models (LLMs) while maintaining competitive performance, especially for models with massive parameters. However, prior work has largely focused on utility and efficiency, leaving the safety risks associated with this sparse architecture underexplored. In this work, we show that the safety of MoE LLMs is as sparse as their architecture by discovering unsafe routes: routing configurations that, once activated, convert safe outputs into harmful ones. Specifically, we first introduce the Router Safety importance score (RoSais) to quantify the safety criticality of each layer's router. Manipulation of only the high-RoSais router(s) can flip the default route into an unsafe one. For instance, on JailbreakBench, masking 5 routers in DeepSeek-V2-Lite increases attack success rate (ASR) by over 4$\times$ to 0.79, highlighting an inherent risk that router manipulation may naturally occur in MoE LLMs. We further propose a Fine-grained token-layer-wise Stochastic Optimization framework to discover more concrete Unsafe Routes (F-SOUR), which explicitly considers the sequentiality and dynamics of input tokens. Across four representative MoE LLM families, F-SOUR achieves an average ASR of 0.90 and 0.98 on JailbreakBench and AdvBench, respectively. Finally, we outline defensive perspectives, including safety-aware route disabling and router training, as promising directions to safeguard MoE LLMs. We hope our work can inform future red-teaming and safeguarding of MoE LLMs. Our code is provided in https://github.com/TrustAIRLab/UnsafeMoE.
翻译:通过引入路由器在Transformer层中选择性激活专家,混合专家架构显著降低了大型语言模型的计算成本,同时保持了具有竞争力的性能,尤其对于参数量巨大的模型。然而,先前的研究主要关注效用和效率,使得与这种稀疏架构相关的安全风险未得到充分探索。在本工作中,我们通过发现不安全路径——即一旦激活,会将安全输出转换为有害输出的路由配置——来证明MoE大语言模型的安全性与其架构一样稀疏。具体而言,我们首先引入路由器安全重要性评分来量化每层路由器对安全的关键性。仅操纵高RoSais路由器即可将默认路径翻转为不安全路径。例如,在JailbreakBench上,对DeepSeek-V2-Lite模型中的5个路由器进行掩码处理,可将攻击成功率提升超过4倍至0.79,这突显了路由器操纵可能在MoE大语言模型中自然发生的固有风险。我们进一步提出了一个细粒度令牌-层级随机优化框架来发现更具体的不安全路径,该框架明确考虑了输入令牌的顺序性和动态性。在四个代表性的MoE大语言模型系列上,F-SOUR在JailbreakBench和AdvBench上分别实现了平均0.90和0.98的攻击成功率。最后,我们概述了防御视角,包括安全感知的路径禁用和路由器训练,作为保护MoE大语言模型的有前景的方向。我们希望我们的工作能为未来MoE大语言模型的红队测试和安全防护提供参考。我们的代码发布于https://github.com/TrustAIRLab/UnsafeMoE。