TheMCPCompany: Creating General-purpose Agents with Task-specific Tools

Since the introduction of the Model Context Protocol (MCP), the number of available tools for Large Language Models (LLMs) has increased significantly. These task-specific tool sets offer an alternative to general-purpose tools such as web browsers, while being easier to develop and maintain than GUIs. However, current general-purpose agents predominantly rely on web browsers for interacting with the environment. Here, we introduce TheMCPCompany, a benchmark for evaluating tool-calling agents on tasks that involve interacting with various real-world services. We use the REST APIs of these services to create MCP servers, which include over 18,000 tools. We also provide manually annotated ground-truth tools for each task. In our experiments, we use the ground truth tools to show the potential of tool-calling agents for both improving performance and reducing costs assuming perfect tool retrieval. Next, we explore agent performance using tool retrieval to study the real-world practicality of tool-based agents. While all models with tool retrieval perform similarly or better than browser-based agents, smaller models cannot take full advantage of the available tools through retrieval. On the other hand, GPT-5's performance with tool retrieval is very close to its performance with ground-truth tools. Overall, our work shows that the most advanced reasoning models are effective at discovering tools in simpler environments, but seriously struggle with navigating complex enterprise environments. TheMCPCompany reveals that navigating tens of thousands of tools and combining them in non-trivial ways to solve complex problems is still a challenging task for current models and requires both better reasoning and better retrieval models.

翻译：自模型上下文协议（MCP）提出以来，面向大型语言模型（LLM）的可用工具数量显著增加。这些任务专用工具集为通用工具（如网络浏览器）提供了替代方案，同时比图形用户界面（GUI）更易于开发和维护。然而，当前通用智能体主要依赖网络浏览器与环境进行交互。本文提出TheMCPCompany基准，用于评估工具调用型智能体在涉及各类现实服务交互任务中的表现。我们利用这些服务的REST API构建了包含超过18,000个工具的MCP服务器，并为每个任务提供人工标注的真实工具集。实验首先通过真实工具集验证了在理想工具检索条件下，工具调用型智能体在提升性能与降低成本方面的潜力。随后通过工具检索机制探究智能体在实际场景中的实用性。研究发现：所有采用工具检索的模型表现均与基于浏览器的智能体相当或更优，但较小模型无法通过检索充分利用可用工具；而GPT-5在工具检索下的性能已接近其使用真实工具集的水平。总体而言，最先进的推理模型在简单环境中能有效发现工具，但在复杂企业环境中仍面临严峻挑战。TheMCPCompany基准表明：当前模型在导航数万种工具并以非平凡方式组合解决复杂问题方面仍存在困难，这需要更强大的推理能力与检索模型的支持。

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日