Defining query difficulty is one of the hardest problems in deployment engineering. Existing LLM routers rely on surface features such as domain labels, keywords, and token count, ignoring the within-domain variance that actually determines model success. Frontier models cost ten to one hundred times more than local open-weight models, so at production scale even small per-request savings become a direct cloud-bill lever. We present Brick, a multimodal router that scores each model on six capability dimensions, combines this with a per-query difficulty estimate, and dispatches via a cost-penalized geometric rule. A continuous preference knob lets operators slide between max-quality and max-saving profiles at deploy time. On a benchmark of 5,504 queries, Brick at max-quality reaches 76.98% accuracy, beating the best single model (75.02%) and all tested routers. At a neutral cost-quality profile, Brick achieves 74.11% accuracy at 4.71x lower cost than always using the strongest model. At min-cost, it cuts cost 22.15x with 11.85 points accuracy loss. Median latency drops from 51.2s to 22.8s.
翻译:摘要:查询难度定义是部署工程中最困难的问题之一。现有LLM路由器依赖领域标签、关键词和令牌计数等表面特征,忽略了实际决定模型成功与否的领域内方差。前沿模型的成本是本地开源权重模型的10至100倍,因此在生产规模下,即使每个请求的微小节省也会成为直接的云账单杠杆。我们提出Brick,一种多模态路由器,它在六个能力维度上对每个模型进行评分,结合每次查询的难度估计,并通过成本惩罚的几何规则进行调度。一个连续偏好旋钮允许操作员在部署时在最大质量和最大节省配置之间切换。在5,504个查询的基准测试中,Brick在最大质量模式下达到76.98%的准确率,超过了最佳单一模型(75.02%)和所有测试过的路由器。在中性成本-质量配置下,Brick以74.11%的准确率实现成本比始终使用最强模型降低4.71倍。在最低成本模式下,它削减成本22.15倍,准确率损失11.85个百分点。中位延迟从51.2秒降至22.8秒。