Large Language Model (LLM) routers dynamically select optimal models for given inputs. Existing approaches typically assume access to ground-truth labeled data, which is often unavailable in practice, especially when user request distributions are heterogeneous and unknown. We introduce Routing with Generated Data (RGD), a challenging setting in which routers are trained exclusively on generated queries and answers produced from high-level task descriptions by generator LLMs. We evaluate query-answer routers (using both queries and labels) and query-only routers across four diverse benchmarks and 12 models, finding that query-answer routers degrade faster than query-only routers as generator quality decreases. Our analysis reveals two crucial characteristics of effective generators: they must accurately respond to their own questions, and their questions must produce sufficient performance differentiation among the model pool. We then show how filtering for these characteristics can improve the quality of generated data. We further propose CASCAL, a novel query-only router that estimates model correctness through consensus voting and identifies model-specific skill niches via hierarchical clustering. CASCAL is substantially more robust to generator quality, outperforming the best query-answer router by 4.6% absolute accuracy when trained on weak generator data.
翻译:大型语言模型(LLM)路由器能够为给定输入动态选择最优模型。现有方法通常假设能够访问真实标注数据,而这在实践中往往难以获得,尤其是在用户请求分布异构且未知的情况下。我们提出了基于生成数据的路由(RGD),这是一个具有挑战性的设定,其中路由器仅通过生成器LLM根据高层任务描述所产生的生成查询及其答案进行训练。我们在四个多样化基准测试和12个模型上评估了查询-答案路由器(同时使用查询和标签)与仅查询路由器,发现当生成器质量下降时,查询-答案路由器的性能退化速度比仅查询路由器更快。我们的分析揭示了有效生成器的两个关键特性:它们必须能够准确回答自身提出的问题,并且其生成的问题必须在模型池中产生足够的性能区分度。我们进一步展示了如何通过筛选具备这些特性的生成器来提升生成数据的质量。此外,我们提出了CASCAL,一种新颖的仅查询路由器,它通过共识投票来估计模型正确率,并利用层次聚类识别模型特定的技能专长。CASCAL对生成器质量具有显著更强的鲁棒性,当在弱生成器数据上训练时,其绝对准确率比最佳查询-答案路由器高出4.6%。