AgentSelect: Benchmark for Narrative Query-to-Agent Recommendation

LLM agents are rapidly becoming the practical interface for task automation, yet the ecosystem lacks a principled way to choose among an exploding space of deployable configurations. Existing LLM leaderboards and tool/agent benchmarks evaluate components in isolation and remain fragmented across tasks, metrics, and candidate pools, leaving a critical research gap: there is little query-conditioned supervision for learning to recommend end-to-end agent configurations that couple a backbone model with a toolkit. We address this gap with AgentSelect, a benchmark that reframes agent selection as narrative query-to-agent recommendation over capability profiles and systematically converts heterogeneous evaluation artifacts into unified, positive-only interaction data. AgentSelectcomprises 111,179 queries, 107,721 deployable agents, and 251,103 interaction records aggregated from 40+ sources, spanning LLM-only, toolkit-only, and compositional agents. Our analyses reveal a regime shift from dense head reuse to long-tail, near one-off supervision, where popularity-based CF/GNN methods become fragile and content-aware capability matching is essential. We further show that Part~III synthesized compositional interactions are learnable, induce capability-sensitive behavior under controlled counterfactual edits, and improve coverage over realistic compositions; models trained on AgentSelect also transfer to a public agent marketplace (MuleRun), yielding consistent gains on an unseen catalog. Overall, AgentSelect provides the first unified data and evaluation infrastructure for agent recommendation, which establishes a reproducible foundation to study and accelerate the emerging agent ecosystem.

翻译：大型语言模型（LLM）智能体正迅速成为任务自动化的实用接口，然而当前生态系统缺乏一种原则性方法，以在爆炸式增长的可部署配置空间中进行选择。现有的LLM排行榜及工具/智能体基准测试仅孤立评估各组件，且在不同任务、指标与候选池间保持割裂状态，导致存在关键研究缺口：目前缺乏针对端到端智能体配置推荐（即耦合骨干模型与工具集的配置）的查询条件监督机制。我们通过AgentSelect基准填补这一缺口，该基准将智能体选择重构为基于能力画像的叙事查询至智能体推荐任务，并系统性地将异构评估数据转化为统一的纯正向交互数据。AgentSelect整合了来自40余个数据源的111,179条查询、107,721个可部署智能体及251,103条交互记录，涵盖纯LLM型、纯工具型及组合型智能体。我们的分析揭示了从密集头部复用到长尾近一次性监督的机制转变，在此场景下基于流行度的协同过滤/图神经网络方法变得脆弱，而基于内容的能力匹配至关重要。我们进一步证明：第三部分合成的组合交互具有可学习性，能在受控反事实编辑下诱导出能力敏感行为，并提升对现实组合的覆盖度；基于AgentSelect训练的模型还可迁移至公开智能体市场（MuleRun），在未见目录上实现稳定的性能提升。总体而言，AgentSelect为首个面向智能体推荐的统一数据与评估基础设施，为研究和加速新兴智能体生态系统奠定了可复现的基础。