The Model Context Protocol (MCP) has rapidly become a de facto standard for connecting LLM-based agents with external tools via reusable MCP servers. In practice, however, server selection and onboarding rely heavily on free-text tool descriptions that are intentionally loosely constrained. Although this flexibility largely ensures the scalability of MCP servers, it also creates a reliability gap that descriptions often misrepresent or omit key semantics, increasing trial-and-error integration, degrading agent behavior, and potentially introducing security risks. To this end, we present the first systematic study of description smells in MCP tool descriptions and their impact on usability. Specifically, we synthesize software/API documentation practices and agentic tool-use requirements into a four-dimensional quality standard: accuracy, functionality, information completeness, and conciseness, covering 18 specific smell categories. Using this standard, we conducted a large-scale empirical study on a well-constructed dataset of 10,831 MCP servers. We find that description smells are pervasive (e.g., 73% repeated tool names, thousands with incorrect parameter semantics or missing return descriptions), reflecting a "code-first, description-last" pattern. Through a controlled mutation-based study, we show these smells significantly affect LLM tool selection, with functionality and accuracy having the largest effects (+11.6% and +8.8%, p < 0.001). In competitive settings with functionally equivalent servers, standard-compliant descriptions reach 72% selection probability (260% over a 20% baseline), demonstrating that smell-guided remediation yields substantial practical benefits. We release our labeled dataset and standards to support future work on reliable and secure MCP ecosystems.
翻译:模型上下文协议(Model Context Protocol,MCP)已迅速成为通过可复用的MCP服务器将基于大语言模型的智能体与外部工具连接的事实标准。然而,在实践中,服务器的选择与接入严重依赖于自由文本的工具描述,这些描述被有意设计为约束宽松。尽管这种灵活性在很大程度上确保了MCP服务器的可扩展性,但也造成了可靠性缺口:描述常常误传或遗漏关键语义,从而增加了试错式集成、降低了智能体行为质量,并可能引入安全风险。为此,我们首次对MCP工具描述中的描述异味及其对可用性的影响进行了系统性研究。具体而言,我们综合了软件/API文档实践与智能体工具使用需求,构建了一个四维质量标准:准确性、功能性、信息完整性与简洁性,涵盖了18个具体的异味类别。基于此标准,我们在一个精心构建的、包含10,831个MCP服务器的数据集上进行了大规模实证研究。我们发现描述异味普遍存在(例如,73%的工具名称重复,数千个描述存在参数语义错误或缺少返回描述),这反映了一种“代码优先,描述最后”的模式。通过一项基于受控变异的研究,我们证明这些异味显著影响大语言模型的工具选择,其中功能性与准确性的影响最大(分别提升+11.6%与+8.8%,p < 0.001)。在功能等效服务器的竞争环境中,符合标准的描述达到了72%的选择概率(相对于20%的基线提升了260%),这表明基于异味的修复能带来显著的实际效益。我们公开了标注数据集与标准,以支持未来构建可靠、安全的MCP生态系统相关工作。