Medical calculators are fundamental to quantitative, evidence-based clinical practice. However, their real-world use is an adaptive, multi-stage process, requiring proactive EHR data acquisition, scenario-dependent calculator selection, and multi-step computation, whereas current benchmarks focus only on static single-step calculations with explicit instructions. To address these limitations, we introduce MedMCP-Calc, the first benchmark for evaluating LLMs in realistic medical calculator scenarios through Model Context Protocol (MCP) integration. MedMCP-Calc comprises 118 scenario tasks across 4 clinical domains, featuring fuzzy task descriptions mimicking natural queries, structured EHR database interaction, external reference retrieval, and process-level evaluation. Our evaluation of 23 leading models reveals critical limitations: even top performers like Claude Opus 4.5 exhibit substantial gaps, including difficulty selecting appropriate calculators for end-to-end workflows given fuzzy queries, poor performance in iterative SQL-based database interactions, and marked reluctance to leverage external tools for numerical computation. Performance also varies considerably across clinical domains. Building on these findings, we develop CalcMate, a fine-tuned model incorporating scenario planning and tool augmentation, achieving state-of-the-art performance among open-source models. Benchmark and Codes are available in https://github.com/SPIRAL-MED/MedMCP-Calc.
翻译:医疗计算器是定量、循证临床实践的基础。然而,其实际应用是一个自适应的多阶段过程,需要主动获取电子健康记录数据、根据场景选择计算器并进行多步计算,而现有基准测试仅关注具有明确指令的静态单步计算。为解决这些局限性,我们提出了MedMCP-Calc,这是首个通过模型上下文协议集成来评估大语言模型在真实医疗计算器场景中表现的基准。MedMCP-Calc包含4个临床领域的118个场景任务,其特点包括模拟自然查询的模糊任务描述、结构化电子健康记录数据库交互、外部参考检索以及过程级评估。我们对23个领先模型的评估揭示了关键局限:即使是Claude Opus 4.5等顶级模型也存在显著差距,包括难以根据模糊查询为端到端工作流选择合适的计算器、在基于SQL的迭代数据库交互中表现不佳,以及明显不愿利用外部工具进行数值计算。不同临床领域的性能也存在显著差异。基于这些发现,我们开发了CalcMate,这是一个融合了场景规划和工具增强的微调模型,在开源模型中实现了最先进的性能。基准测试和代码可在https://github.com/SPIRAL-MED/MedMCP-Calc获取。