RerouteGuard: Understanding and Mitigating Adversarial Risks for LLM Routing

Recent advancements in multi-model AI systems have leveraged LLM routers to reduce computational cost while maintaining response quality by assigning queries to the most appropriate model. However, as classifiers, LLM routers are vulnerable to novel adversarial attacks in the form of LLM rerouting, where adversaries prepend specially crafted triggers to user queries to manipulate routing decisions. Such attacks can lead to increased computational cost, degraded response quality, and even bypass safety guardrails, yet their security implications remain largely underexplored. In this work, we bridge this gap by systematizing LLM rerouting threats based on the adversary's objectives (i.e., cost escalation, quality hijacking, and safety bypass) and knowledge. Based on the threat taxonomy, we conduct a measurement study of real-world LLM routing systems against existing LLM rerouting attacks. The results reveal that existing routing systems are vulnerable to rerouting attacks, especially in the cost escalation scenario. We then characterize existing rerouting attacks using interpretability techniques, revealing that they exploit router decision boundaries through confounder gadgets that prepend queries to force misrouting. To mitigate these risks, we introduce RerouteGuard, a flexible and scalable guardrail framework for LLM rerouting. RerouteGuard filters adversarial rerouting prompts via dynamic embedding-based detection and adaptive thresholding. Extensive evaluations in three attack settings and four benchmarks demonstrate that RerouteGuard achieves over 99% detection accuracy against state-of-the-art rerouting attacks, while maintaining negligible impact on legitimate queries. The experimental results indicate that RerouteGuard offers a principled and practical solution for safeguarding multi-model AI systems against adversarial rerouting.

翻译：近年来，多模型人工智能系统通过采用大语言模型（LLM）路由器，将查询分配给最合适的模型，在保持响应质量的同时降低了计算成本。然而，作为分类器，LLM路由器容易遭受一种新型对抗性攻击——LLM重路由攻击，即攻击者在用户查询前添加精心构造的触发器，以操纵路由决策。此类攻击可能导致计算成本增加、响应质量下降，甚至绕过安全护栏，但其安全影响在很大程度上仍未得到充分探索。在本研究中，我们基于攻击者的目标（即成本提升、质量劫持和安全绕过）及其知识水平，系统化地梳理了LLM重路由威胁。基于该威胁分类体系，我们对现实世界中的LLM路由系统针对现有LLM重路由攻击进行了测量研究。结果表明，现有路由系统易受重路由攻击，尤其在成本提升场景下。随后，我们利用可解释性技术对现有重路由攻击进行了特征分析，揭示其通过添加混淆器组件（confounder gadgets）来预置查询，从而利用路由器决策边界强制实现错误路由。为缓解这些风险，我们提出了RerouteGuard——一个灵活且可扩展的LLM重路由防护框架。RerouteGuard通过基于动态嵌入的检测和自适应阈值技术来过滤对抗性重路由提示。在三种攻击场景和四个基准测试上的广泛评估表明，RerouteGuard对最先进的重路由攻击实现了超过99%的检测准确率，同时对合法查询的影响可忽略不计。实验结果表明，RerouteGuard为保护多模型人工智能系统免受对抗性重路由攻击提供了一种原则性且实用的解决方案。