Thinking Like a Radiologist: A Dataset for Anatomy-Guided Interleaved Vision Language Reasoning in Chest X-ray Interpretation

Radiological diagnosis is a perceptual process in which careful visual inspection and language reasoning are repeatedly interleaved. Most medical large vision language models (LVLMs) perform visual inspection only once and then rely on text-only chain-of-thought (CoT) reasoning, which operates purely in the linguistic space and is prone to hallucination. Recent methods attempt to mitigate this issue by introducing visually related coordinates, such as bounding boxes. However, these remain a pseudo-visual solution: coordinates are still text and fail to preserve rich visual details like texture and density. Motivated by the interleaved nature of radiological diagnosis, we introduce MMRad-IVL-22K, the first large-scale dataset designed for natively interleaved visual language reasoning in chest X-ray interpretation. MMRad-IVL-22K reflects a repeated cycle of reasoning and visual inspection workflow of radiologists, in which visual rationales complement textual descriptions and ground each step of the reasoning process. MMRad-IVL-22K comprises 21,994 diagnostic traces, enabling systematic scanning across 35 anatomical regions. Experimental results on advanced closed-source LVLMs demonstrate that report generation guided by multimodal CoT significantly outperforms that guided by text-only CoT in clinical accuracy and report quality (e.g., 6\% increase in the RadGraph metric), confirming that high-fidelity interleaved vision language evidence is a non-substitutable component of reliable medical AI. Furthermore, benchmarking across seven state-of-the-art open-source LVLMs demonstrates that models fine-tuned on MMRad-IVL-22K achieve superior reasoning consistency and report quality compared with both general-purpose and medical-specific LVLMs. The project page is available at https://github.com/qiuzyc/thinking_like_a_radiologist.

翻译：放射学诊断是一个感知过程，其中细致的视觉检查与语言推理反复交错进行。大多数医学大型视觉语言模型仅执行一次视觉检查，随后依赖纯文本的思维链推理，这种推理完全在语言空间中进行，容易产生幻觉。近期方法尝试通过引入视觉相关坐标（如边界框）来缓解此问题。然而，这些仍是伪视觉解决方案：坐标依然是文本，无法保留如纹理和密度等丰富的视觉细节。受放射学诊断交错特性的启发，我们提出了MMRad-IVL-22K，这是首个专为胸部X光解读中本征交错视觉语言推理设计的大规模数据集。MMRad-IVL-22K反映了放射科医生推理与视觉检查工作流程的重复循环，其中视觉依据补充文本描述，并为推理过程的每一步提供基础。MMRad-IVL-22K包含21,994条诊断轨迹，支持对35个解剖区域的系统扫描。在先进闭源大型视觉语言模型上的实验结果表明，由多模态思维链引导的报告生成在临床准确性和报告质量上显著优于纯文本思维链引导的生成（例如，RadGraph指标提升6%），证实了高保真度的交错视觉语言证据是可靠医学人工智能不可替代的组成部分。此外，对七个最先进开源大型视觉语言模型的基准测试表明，在MMRad-IVL-22K上微调的模型相比通用及医学专用大型视觉语言模型，实现了更优的推理一致性和报告质量。项目页面位于 https://github.com/qiuzyc/thinking_like_a_radiologist。