Real-world clinical practice demands multi-image comparative reasoning, yet current medical benchmarks remain limited to single-frame interpretation. We present MedFrameQA, the first benchmark explicitly designed to test multi-image medical VQA through educationally-validated diagnostic sequences. To construct this dataset, we develop a scalable pipeline that leverages narrative transcripts from medical education videos to align visual frames with textual concepts, automatically producing 2,851 high-quality multi-image VQA pairs with explicit, transcript-grounded reasoning chains. Our evaluation of 11 advanced MLLMs (including reasoning models) exposes severe deficiencies in multi-image synthesis, where accuracies mostly fall below 50% and exhibit instability across varying image counts. Error analysis demonstrates that models often treat images as isolated instances, failing to track pathological progression or cross-reference anatomical shifts. MedFrameQA provides a rigorous standard for evaluating the next generation of MLLMs in handling complex, temporally grounded medical narratives.
翻译:真实世界的临床实践需要多图像对比推理,然而当前的医学基准仍局限于单帧图像解读。我们提出了MedFrameQA,这是首个通过教育验证的诊断序列来显式测试多图像医学视觉问答的基准。为构建该数据集,我们开发了一个可扩展的流水线,利用医学教育视频的叙述性转录文本将视觉帧与文本概念对齐,自动生成了2,851个高质量的多图像视觉问答对,并附带显式的、基于转录文本的推理链。我们对11种先进的多模态大语言模型(包括推理模型)的评估揭示了其在多图像综合理解方面存在严重缺陷,模型准确率大多低于50%,且在不同图像数量下表现出不稳定性。错误分析表明,模型通常将图像视为孤立实例,未能追踪病理进展或交叉参考解剖结构变化。MedFrameQA为评估下一代多模态大语言模型处理复杂、具有时间基础的医学叙事能力提供了严格标准。