Medical imaging benchmarks often evaluate VLMs on pre-selected 2D images, slices, crops, or patches, making evaluation closer to visual recognition. Real clinical workflows impose a different burden: readers must search through complete studies, operate imaging software, navigate across slices and magnifications, and document visual evidence that can be audited. We argue that this evidence-producing workflow is a critical missing evaluation axis for medical imaging agents. To study it, we introduce MedFlowBench, a full-study benchmark for VLM agents, together with MedOpenClaw, a controlled and replayable runtime in which agents operate medical imaging viewers such as 3D Slicer and QuPath. In each episode, an agent inspects a complete radiology study or whole-slide pathology image, returns a task answer, and submits structured evidence, including key slices, coordinates, regions of interest, or lesion-state fields. This evidence is automatically checked against withheld masks, annotations, and labels. Across evaluated models, final answer-only scoring gives an overly optimistic picture: when answers must also be supported by correct evidence, performance drops substantially on complex workflows. We further find that adding image-analysis tools does not by itself solve the problem. Tools help when they make a complex procedure simple and reliable, but agents still struggle when they must choose inputs, manage viewer state, and verify intermediate outputs over multiple steps. MedFlowBench exposes whether medical imaging agents can produce auditable evidence from complete studies, rather than plausible answers from selected images.
翻译:医学影像基准测试通常评估VLM在预选二维图像、切片、裁剪区域或补丁上的表现,使评估更接近视觉识别任务。而真实临床工作流提出不同挑战:阅片者需检查完整检查序列、操作影像软件、导航切片与放大倍数,并记录可供审计的视觉证据。我们认为,这种证据生成工作流是医学影像智能体评估中缺失的关键维度。为研究该问题,我们提出MedFlowBench——面向VLM智能体的全研究基准,以及MedOpenClaw——一个受控可复现的运行环境,智能体可在其中操作3D Slicer与QuPath等医学影像浏览器。在每个任务回合中,智能体需检查完整放射学研究或全切片病理图像,返回任务答案并提交结构化证据(包括关键切片、坐标、感兴趣区域或病灶状态字段)。这些证据将自动与隐藏的掩膜、标注和标签进行比对。在所有评估模型中,仅基于最终答案的评分呈现过于乐观的假象:当答案必须附带正确证据时,复杂工作流中的性能显著下降。进一步发现,添加图像分析工具本身无法解决该问题。工具仅在将复杂流程简化可靠时发挥作用,但当智能体需自主选择输入、管理浏览器状态并逐步骤验证中间输出时,仍面临困难。MedFlowBench揭示了医学影像智能体能否从完整检查序列生成可审计证据,而非仅从选定图像给出看似合理的答案。