MedSPOT: A Workflow-Aware Sequential Grounding Benchmark for Clinical GUI

Despite the rapid progress of Multimodal Large Language Models (MLLMs), their ability to perform reliable visual grounding in high-stakes clinical software environments remains underexplored. Existing GUI benchmarks largely focus on isolated, single-step grounding queries, overlooking the sequential, workflow-driven reasoning required in real-world medical interfaces, where tasks evolve across independent steps and dynamic interface states. We introduce MedSPOT, a workflow-aware sequential grounding benchmark for clinical GUI environments. Unlike prior benchmarks that treat grounding as a standalone prediction task, MedSPOT models procedural interaction as a sequence of structured spatial decisions. The benchmark comprises 216 task-driven videos with 597 annotated keyframes, in which each task consists of 2 to 3 interdependent grounding steps within realistic medical workflows. This design captures interface hierarchies, contextual dependencies, and fine-grained spatial precision under evolving conditions. To evaluate procedural robustness, we propose a strict sequential evaluation protocol that terminates task assessment upon the first incorrect grounding prediction, explicitly measuring error propagation in multi-step workflows. We further introduce a comprehensive failure taxonomy, including edge bias, small-target errors, no prediction, near miss, far miss, and toolbar confusion, to enable systematic diagnosis of model behavior in clinical GUI settings. By shifting evaluation from isolated grounding to workflow-aware sequential reasoning, MedSPOT establishes a realistic and safety-critical benchmark for assessing multimodal models in medical software environments. Code and data are available at: https://github.com/Tajamul21/MedSPOT.

翻译：尽管多模态大语言模型（MLLMs）取得了快速进展，但其在高风险临床软件环境中执行可靠视觉定位的能力仍未被充分探索。现有图形用户界面（GUI）基准主要聚焦于孤立的单步定位查询，忽视了真实医学界面中所需的顺序性、工作流驱动的推理——在这些界面中，任务需跨独立步骤和动态界面状态逐步演化。我们提出MedSPOT，一种面向临床GUI环境的工作流感知顺序定位基准。与将定位视为独立预测任务的传统基准不同，MedSPOT将程序化交互建模为一系列结构化空间决策序列。该基准包含216个任务驱动视频及597个标注关键帧，每个任务由2至3个相互依赖的定位步骤组成，嵌入在逼真的医学工作流中。这种设计捕捉了界面层次结构、上下文依赖性以及动态条件下的细粒度空间精度。为评估程序鲁棒性，我们提出严格的顺序评估协议：一旦首个定位预测错误即终止任务评估，显式衡量多步工作流中的误差传播效应。我们进一步引入系统性故障分类体系，包括边缘偏差、小目标错误、无预测、近失准、远失准及工具栏混淆六类，以支持临床GUI场景下模型行为的诊断性分析。通过将评估从孤立定位转向工作流感知顺序推理，MedSPOT为医学软件环境中的多模态模型评估建立了贴近现实且关乎安全性的基准。代码与数据见：https://github.com/Tajamul21/MedSPOT。