Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models by encouraging step-by-step intermediate reasoning, and recent advances have extended this paradigm to Multimodal Large Language Models (MLLMs). In the medical domain, where diagnostic decisions depend on nuanced visual cues and sequential reasoning, CoT aligns naturally with clinical thinking processes. However, Current benchmarks for medical image understanding generally focus on the final answer while ignoring the reasoning path. An opaque process lacks reliable bases for judgment, making it difficult to assist doctors in diagnosis. To address this gap, we introduce a new M3CoTBench benchmark specifically designed to evaluate the correctness, efficiency, impact, and consistency of CoT reasoning in medical image understanding. M3CoTBench features 1) a diverse, multi-level difficulty dataset covering 24 examination types, 2) 13 varying-difficulty tasks, 3) a suite of CoT-specific evaluation metrics (correctness, efficiency, impact, and consistency) tailored to clinical reasoning, and 4) a performance analysis of multiple MLLMs. M3CoTBench systematically evaluates CoT reasoning across diverse medical imaging tasks, revealing current limitations of MLLMs in generating reliable and clinically interpretable reasoning, and aims to foster the development of transparent, trustworthy, and diagnostically accurate AI systems for healthcare. Project page at https://juntaojianggavin.github.io/projects/M3CoTBench/.
翻译:思维链(CoT)推理已被证明能通过鼓励逐步的中间推理来增强大语言模型,而近期的进展已将这一范式扩展到多模态大语言模型(MLLMs)。在医学领域,诊断决策依赖于细微的视觉线索和顺序推理,CoT与临床思维过程天然契合。然而,当前医学图像理解的基准评测通常只关注最终答案,而忽略了推理路径。不透明的过程缺乏可靠的判断依据,难以辅助医生进行诊断。为弥补这一空白,我们引入了一个新的M3CoTBench基准,专门用于评估医学图像理解中CoT推理的正确性、效率、影响力和一致性。M3CoTBench具有以下特点:1)一个涵盖24种检查类型的多样化、多难度级别的数据集;2)13项难度不一的任务;3)一套针对临床推理定制的CoT专项评估指标(正确性、效率、影响力和一致性);4)对多种MLLM的性能分析。M3CoTBench系统性地评估了CoT推理在多样化医学影像任务中的表现,揭示了当前MLLM在生成可靠且临床可解释的推理方面存在的局限,并旨在推动开发透明、可信且诊断准确的医疗AI系统。项目页面位于 https://juntaojianggavin.github.io/projects/M3CoTBench/。