The Model Context Protocol (MCP) is rapidly becoming the standard interface for Large Language Models (LLMs) to discover and invoke external tools. However, existing evaluations often fail to capture the complexity of real-world scenarios, relying on restricted toolsets, simplistic workflows, or subjective LLM-as-a-judge metrics. We introduce MCP-Atlas, a large-scale benchmark for evaluating tool-use competency, comprising 36 real MCP servers and 220 tools. It includes 1,000 tasks designed to assess tool-use competency in realistic, multi-step workflows. Tasks use natural language prompts that avoid naming specific tools or servers, requiring agents to identify and orchestrate 3-6 tool calls across multiple servers. We score tasks using a claims-based rubric that awards partial credit based on the factual claims satisfied in the model's final answer, complemented by internal diagnostics on tool discovery, parameterization, syntax, error recovery, and efficiency. Evaluation results on frontier models reveal that top models achieve pass rates exceeding 50%, with primary failures arising from inadequate tool usage and task understanding. We release the task schema, containerized harness, and a 500-task public subset of the benchmark dataset to facilitate reproducible comparisons and advance the development of robust, tool-augmented agents.
翻译:模型上下文协议(Model Context Protocol, MCP)正迅速成为大型语言模型(LLM)发现和调用外部工具的标准接口。然而,现有评估方法往往未能捕捉真实场景的复杂性,依赖于受限的工具集、简单化的工作流程或主观的LLM-as-a-judge评估指标。我们提出了MCP-Atlas,一个用于评估工具使用能力的大规模基准测试,包含36个真实的MCP服务器和220个工具。该基准包含1000个任务,旨在评估现实、多步骤工作流程中的工具使用能力。任务使用自然语言提示,避免提及特定工具或服务器名称,要求智能体识别并编排跨多个服务器的3-6次工具调用。我们采用基于声明的评分标准对任务进行评分,根据模型最终答案中满足的事实声明给予部分分数,并辅以对工具发现、参数化、语法、错误恢复和效率的内部诊断。对前沿模型的评估结果显示,顶级模型的通过率超过50%,主要失败原因在于工具使用不足和任务理解偏差。我们发布了任务模式、容器化测试框架以及包含500个任务的基准数据集公开子集,以促进可复现的比较,并推动开发鲁棒的工具增强智能体。