The rapid growth of AI agent ecosystems is transforming how complex tasks are delegated and executed, creating a new challenge of identifying suitable agents for a given task. Unlike traditional tools, agent capabilities are often compositional and execution-dependent, making them difficult to assess from textual descriptions alone. However, existing research and benchmarks typically assume well-specified functionalities, controlled candidate pools, or only executable task queries, leaving realistic agent search scenarios insufficiently studied. We introduce AgentSearchBench, a large-scale benchmark for agent search in the wild, built from nearly 10,000 real-world agents across multiple providers. The benchmark formalizes agent search as retrieval and reranking problems under both executable task queries and high-level task descriptions, and evaluates relevance using execution-grounded performance signals. Experiments reveal a consistent gap between semantic similarity and actual agent performance, exposing the limitations of description-based retrieval and reranking methods. We further show that lightweight behavioral signals, including execution-aware probing, can substantially improve ranking quality, highlighting the importance of incorporating execution signals into agent discovery. Our code is available at https://github.com/Bingo-W/AgentSearchBench.
翻译:人工智能智能体生态系统的快速发展正在改变复杂任务的委派与执行方式,由此催生出如何为给定任务甄选合适智能体的新挑战。与传统工具不同,智能体能力往往具有组合性与执行依赖性,难以仅凭文本描述进行评估。然而,现有研究与基准测试通常预设了明确的功能说明、受控的候选池或仅面向可执行任务查询,导致真实场景中的智能体搜索问题未得到充分研究。我们提出AgentSearchBench——一个基于跨平台近万个真实世界智能体构建的大规模智能体搜索基准测试。该基准将智能体搜索形式化为可执行任务查询与高层任务描述场景下的检索与重排序问题,并利用基于执行性能的信号评估相关性。实验表明,语义相似度与实际智能体性能之间存在持续差距,揭示了基于描述的检索与重排序方法的局限性。我们进一步证明,包含执行感知探测的轻量级行为信号能显著提升排序质量,突显了将执行信号融入智能体发现过程的重要性。我们的代码已开源至https://github.com/Bingo-W/AgentSearchBench。