The Model Context Protocol (MCP) introduces a standard specification that defines how Foundation Model (FM)-based agents should interact with external systems by invoking tools. However, to understand a tool's purpose and features, FMs rely on natural-language tool descriptions, making these descriptions a critical component in guiding FMs to select the optimal tool for a given (sub)task and to pass the right arguments to the tool. While defects or smells in these descriptions can misguide FM-based agents, their prevalence and consequences in the MCP ecosystem remain unclear. Hence, we examine 856 tools spread across 103 MCP servers empirically, assess their description quality, and their impact on agent performance. We identify six components of tool descriptions from the literature, develop a scoring rubric utilizing these components, and then formalize tool description smells based on this rubric. By operationalizing this rubric through an FM-based scanner, we find that 97.1% of the analyzed tool descriptions contain at least one smell, with 56% failing to state their purpose clearly. While augmenting these descriptions for all components improves task success rates by a median of 5.85 percentage points and improves partial goal completion by 15.12%, it also increases the number of execution steps by 67.46% and regresses performance in 16.67% of cases. These results indicate that achieving performance gains is not straightforward; while execution cost can act as a trade-off, execution context can also impact. Furthermore, component ablations show that compact variants of different component combinations often preserve behavioral reliability while reducing unnecessary token overhead, enabling more efficient use of the FM context window and lower execution costs.
翻译:模型上下文协议(MCP)引入了一套标准规范,定义了基于基础模型(FM)的代理应如何通过调用工具与外部系统交互。然而,为了理解工具的目的与特性,FM依赖于自然语言工具描述,这使得这些描述成为引导FM为给定(子)任务选择最优工具并传递正确参数的关键组件。尽管这些描述中的缺陷或“异味”可能误导基于FM的代理,但它们在MCP生态系统中的普遍性及其后果尚不明确。因此,我们实证研究了分布在103个MCP服务器上的856个工具,评估了其描述质量及其对代理性能的影响。我们从文献中识别出工具描述的六个组成部分,基于这些组成部分开发了一套评分标准,并据此形式化了工具描述异味。通过基于FM的扫描器实施该标准,我们发现97.1%的被分析工具描述至少包含一种异味,其中56%未能清晰阐明其目的。尽管为所有组成部分增强这些描述可将任务成功率中位数提升5.85个百分点,并将部分目标完成率提高15.12%,但同时也使执行步骤数增加了67.46%,并在16.67%的情况下导致性能下降。这些结果表明,实现性能增益并非易事;虽然执行成本可作为权衡因素,但执行上下文也可能产生影响。此外,组件消融实验显示,不同组件组合的紧凑变体通常能在保持行为可靠性的同时,减少不必要的令牌开销,从而实现更高效的FM上下文窗口利用和更低的执行成本。