Large language models (LLMs) have achieved success, but cost and privacy constraints necessitate deploying smaller models locally while offloading complex queries to cloud-based models. Existing router evaluations are unsystematic, overlooking scenario-specific requirements and out-of-distribution robustness. We propose RouterXBench, a principled evaluation framework with three dimensions: router ability, scenario alignment, and cross-domain robustness. Unlike prior work that relies on output probabilities or external embeddings, we utilize internal hidden states that capture model uncertainty before answer generation. We introduce ProbeDirichlet, a lightweight router that aggregates cross-layer hidden states via learnable Dirichlet distributions with probabilistic training. Trained on multi-domain data, it generalizes robustly across in-domain and out-of-distribution scenarios. Our results show ProbeDirichlet achieves 16.68% and 18.86% relative improvements over the best baselines in router ability and high-accuracy scenarios, with consistent performance across model families, model scales, heterogeneous tasks, and agentic workflows.
翻译:大语言模型(LLM)已取得成功,但成本与隐私限制使得需要在本地部署较小模型,同时将复杂查询卸载至云端模型。现有路由器评估缺乏系统性,忽视了场景特定需求与分布外鲁棒性。我们提出RouterXBench,一个包含三个维度的原则性评估框架:路由器能力、场景对齐与跨域鲁棒性。与先前依赖输出概率或外部嵌入的工作不同,我们利用答案生成前捕捉模型不确定性的内部隐藏状态。我们引入ProbeDirichlet,一种通过可学习的狄利克雷分布与概率训练聚合跨层隐藏状态的轻量级路由器。经多领域数据训练后,其在领域内与分布外场景中均展现出稳健的泛化能力。实验结果表明,ProbeDirichlet在路由器能力与高精度场景中相对最佳基线分别实现了16.68%与18.86%的相对提升,并在不同模型家族、模型规模、异构任务及智能体工作流中保持稳定性能。