HyDRA: Hybrid Dynamic Routing Architecture for Heterogeneous LLM Pools

Production LLM deployments increasingly maintain heterogeneous model pools spanning order-of-magnitude cost differences. Existing routers make binary strong-vs-weak decisions and couple learned parameters to specific model identities, requiring retraining whenever the catalog changes. We present HyDRA (Hybrid Dynamic Routing Architecture), a framework that predicts fine-grained, multi-dimensional capability requirements per query and matches them against configuration-defined model profiles via shortfall matching. A ModernBERT encoder with K=4 independent sigmoid heads scores each query along reasoning, code generation, debugging, and tool use; a shortfall-matching algorithm then selects the cheapest model whose capabilities meet the predicted requirements. The deployed predictor runs at 86 ms median CPU inference latency in production, and is fully decoupled from the model catalog -- adding or removing models requires only a configuration change, with zero retraining. On SWE-Bench Verified (5-model pool: GPT-5.4-mini, Claude Haiku 4.5, GPT-5.3 Codex, Claude Sonnet 4.6, GPT-5.4), HyDRA's tunable shortfall threshold spans three regimes: peak-quality exceeds the always-strong Claude Sonnet 4.6 baseline (75.4% vs. 74.2% resolution) at 12.9% cost savings; iso-quality matches Sonnet at 54.1% cost savings, a 6x improvement over our prior in-house binary router at 9.1%; aggressive pushes savings to 72.5% for a 3.2-point quality trade. Results generalize across LiveCodeBench, BigCodeBench, and tau-bench. HyDRA is deployed to all users in GitHub Copilot's VS Code Chat auto-mode and -- to our knowledge for the first time in the LLM routing literature -- demonstrates language-invariant routing across CJK, European, and other script families.

翻译：在生产环境中，大语言模型（LLM）的部署日益采用异构模型池，这些模型之间的成本差异可达数个数量级。现有路由器仅进行二元强弱决策，并将学习参数与特定模型身份耦合，因此每当模型目录变更时都需要重新训练。本文提出HyDRA（混合动态路由架构），该框架能够针对每个查询预测细粒度的多维能力需求，并通过短缺匹配算法将需求与配置定义的模型配置文件进行匹配。具体地，一个配备K=4个独立Sigmoid头的ModernBERT编码器会从推理、代码生成、调试和工具使用四个维度对查询进行评分；随后，短缺匹配算法会选择满足预测需求的最廉价模型。该部署预测器在生产环境中的CPU推理中位延迟为86毫秒，且与模型目录完全解耦——添加或移除模型仅需修改配置，无需任何重新训练。在SWE-Bench Verified基准测试中（五模型池：GPT-5.4-mini、Claude Haiku 4.5、GPT-5.3 Codex、Claude Sonnet 4.6、GPT-5.4），HyDRA的可调短缺阈值涵盖三种模式：高质量模式以12.9%的成本节省超越了始终使用强模型Claude Sonnet 4.6的基线（问题解决率75.4% vs 74.2%）；等质量模式在节省54.1%成本的情况下匹配Sonnet的性能，与我们此前基于二元路由器的内部方案（节省9.1%）相比提升了6倍；激进模式则将成本节省推至72.5%，仅付出3.2个百分点的质量损失。这些结果在LiveCodeBench、BigCodeBench和tau-bench上均具有泛化性。HyDRA已部署至GitHub Copilot的VS Code Chat自动模式中供所有用户使用，并且——据我们所知，这是LLM路由文献中首次——展示了跨CJK、欧洲及其他语系的语言无关路由能力。