Mixture-of-Experts (MoE) in Large Language Models (LLMs) routes each token through a subset of specialized Feed-Forward Networks (FFN), known as experts. We present SteerMoE, a framework to steer MoE models by detecting and controlling behavior-associated experts. We detect key experts by comparing how often they activate between paired inputs that demonstrate opposite behaviors (e.g., safe vs. unsafe). By selectively activating or deactivating such experts during inference, we control behaviors like faithfulness and safety without fine-tuning. Across 11 benchmarks and 6 LLMs, our steering raises safety by up to +20% and faithfulness by +27%. Alternatively, unsafe steering drops safety by -41% alone, and -100% when combined with existing jailbreak methods, bypassing all safety guardrails. Overall, SteerMoE offers a lightweight, effective, and widely applicable test-time control, while revealing unique vulnerabilities in MoE LLMs. https://github.com/adobe-research/SteerMoE
翻译:大语言模型中的专家混合机制将每个令牌路由至一组专门的专家前馈网络。我们提出了SteerMoE框架,通过检测与控制行为关联专家来引导MoE模型。我们通过比较专家在呈现相反行为特征的配对输入间的激活频率来识别关键专家。通过在推理过程中选择性激活或停用此类专家,我们无需微调即可控制忠实性与安全性等行为。在11个基准测试和6个大语言模型上的实验表明,我们的引导方法最高可将安全性提升20%,忠实性提升27%。反之,恶意引导可单独降低41%的安全性,若与现有越狱方法结合则可完全突破所有安全防护机制。总体而言,SteerMoE提供了一种轻量级、高效且广泛适用的测试时控制方法,同时揭示了MoE大语言模型特有的安全脆弱性。https://github.com/adobe-research/SteerMoE