Since the introduction of the Model Context Protocol (MCP), the number of available tools for Large Language Models (LLMs) has increased significantly. These task-specific tool sets offer an alternative to general-purpose tools such as web browsers, while being easier to develop and maintain than GUIs. However, current general-purpose agents predominantly rely on web browsers for interacting with the environment. Here, we introduce TheMCPCompany, a benchmark for evaluating tool-calling agents on tasks that involve interacting with various real-world services. We use the REST APIs of these services to create MCP servers, which include over 18,000 tools. We also provide manually annotated ground-truth tools for each task. In our experiments, we use the ground truth tools to show the potential of tool-calling agents for both improving performance and reducing costs assuming perfect tool retrieval. Next, we explore agent performance using tool retrieval to study the real-world practicality of tool-based agents. While all models with tool retrieval perform similarly or better than browser-based agents, smaller models cannot take full advantage of the available tools through retrieval. On the other hand, GPT-5's performance with tool retrieval is very close to its performance with ground-truth tools. Overall, our work shows that the most advanced reasoning models are effective at discovering tools in simpler environments, but seriously struggle with navigating complex enterprise environments. TheMCPCompany reveals that navigating tens of thousands of tools and combining them in non-trivial ways to solve complex problems is still a challenging task for current models and requires both better reasoning and better retrieval models.
翻译:自模型上下文协议(MCP)提出以来,面向大型语言模型(LLM)的可用工具数量显著增加。这些任务专用工具集为通用工具(如网络浏览器)提供了替代方案,同时比图形用户界面(GUI)更易于开发和维护。然而,当前通用智能体主要依赖网络浏览器与环境进行交互。本文提出TheMCPCompany基准,用于评估工具调用型智能体在涉及各类现实服务交互任务中的表现。我们利用这些服务的REST API构建了包含超过18,000个工具的MCP服务器,并为每个任务提供人工标注的真实工具集。实验首先通过真实工具集验证了在理想工具检索条件下,工具调用型智能体在提升性能与降低成本方面的潜力。随后通过工具检索机制探究智能体在实际场景中的实用性。研究发现:所有采用工具检索的模型表现均与基于浏览器的智能体相当或更优,但较小模型无法通过检索充分利用可用工具;而GPT-5在工具检索下的性能已接近其使用真实工具集的水平。总体而言,最先进的推理模型在简单环境中能有效发现工具,但在复杂企业环境中仍面临严峻挑战。TheMCPCompany基准表明:当前模型在导航数万种工具并以非平凡方式组合解决复杂问题方面仍存在困难,这需要更强大的推理能力与检索模型的支持。