To make clinically grounded decisions, medical AI agents are expected to go beyond simple recognition and be capable of tool retrieval, evidence acquisition, and integration. Existing benchmarks largely evaluate isolated perception or single-turn question answering, and therefore provide limited visibility into failures of planning, tool recruitment, and rollout reliability. We introduce MedCTA, a benchmark for evaluating medical tool agents on clinician-validated, step-implicit tasks grounded in realistic multimodal clinical inputs, including radiology images, pathology slides, and reports. MedCTA comprises 107 real-world clinical tasks with clinician-verified executable trajectories over 5 deployed tools, and supports process-aware evaluation of tool selection, argument validity, execution stability, trajectory fidelity, and outcome quality. We benchmark 18 open- and closed-source multimodal models and find that even frontier systems remain brittle in multi-step clinical tool use: autonomous rollouts are dominated by protocol failures, premature stopping, and incorrect tool recruitment, while gold-standard tool routing yields large but still incomplete gains. These results show that strong backbone perception does not translate into reliable agentic behavior in clinical settings. MedCTA provides a rigorous testbed for auditing, diagnosing, and advancing trustworthy medical AI agents. The dataset and evaluation suite are available at https://ivul-kaust.github.io/MedCTA/
翻译:为做出临床依据的决策,医疗人工智能智能体需超越简单识别能力,具备工具检索、证据获取与整合能力。现有基准测试主要评估孤立感知或单轮问答能力,难以揭示规划缺陷、工具调用机制及执行可靠性问题。我们提出MedCTA——面向医学工具智能体的基准测试,基于临床专家验证且隐含步骤的真实多模态临床数据(包括放射影像、病理切片及报告)构建评估体系。该基准包含107项真实临床任务,涵盖经临床专家验证的5个部署工具的完整可执行轨迹,支持对工具选择、参数有效性、执行稳定性、轨迹保真度及结果质量进行过程感知评估。通过对18个开源与闭源多模态模型的基准测试,我们发现前沿系统在多步骤临床工具使用中仍显脆弱:自主执行表现为协议失效、过早终止及工具调用错误为主,而使用黄金标准工具路由虽能显著提升性能但仍有改进空间。该结果表明,强大的骨干感知能力并不能转化为临床场景中可靠的智能体行为。MedCTA为审计、诊断及推动可信医疗AI智能体发展提供了严格的测试平台。数据集与评估套件可通过https://ivul-kaust.github.io/MedCTA/获取。