Production LLM assistants route user requests to growing libraries of specialized tools, but how does routing accuracy degrade as the catalog scales? We study single-step routing on a 110-agent, 584-tool catalog from a deployed enterprise productivity assistant, evaluating three frontier models from 10 to 110 agents. Routing F1 on under-specified requests drops 16--23 percentage points across models. An oracle analysis decomposes the degradation into a \emph{retrieval} gap (the model cannot surface the right tool) and a \emph{confusion} gap (even with perfect retrieval, the oracle ceiling drops 10pp). Embedding-based shortlisting recovers +10--11pp F1 at full scale across all three models and two providers. A production annotation study (1,435 human-labeled utterances, three annotators) confirms the recovery on real traffic at +10--17pp despite 10--15pp lower absolute performance.
翻译:生产级大语言模型助手将用户请求路由到日益庞大的专业化工具库中,但路由准确度如何随工具目录扩展而下降?我们基于已部署的企业生产力助手(包含110个代理、584个工具目录)研究单步路由,评估了从10到110个代理的三种前沿模型。在未充分指定请求上,各模型的路由F1值下降16-23个百分点。通过预言机分析,我们将性能退化分解为检索差距(模型无法浮现正确工具)与混淆差距(即使完美检索,预言机天花板仍下降10个百分点)。基于嵌入的预筛选在全部规模下为三种模型及两家提供商恢复+10-11个百分点的F1值。生产标注研究(1,435条人工标注语句,三位标注员)证实,尽管绝对性能降低10-15个百分点,该恢复方法在真实流量中仍提升10-17个百分点。