Read Like a Radiologist: Efficient Vision-Language Model for 3D Medical Imaging Interpretation

Recent medical vision-language models (VLMs) have shown promise in 2D medical image interpretation. However extending them to 3D medical imaging has been challenging due to computational complexities and data scarcity. Although a few recent VLMs specified for 3D medical imaging have emerged, all are limited to learning volumetric representation of a 3D medical image as a set of sub-volumetric features. Such process introduces overly correlated representations along the z-axis that neglect slice-specific clinical details, particularly for 3D medical images where adjacent slices have low redundancy. To address this limitation, we introduce MS-VLM that mimic radiologists' workflow in 3D medical image interpretation. Specifically, radiologists analyze 3D medical images by examining individual slices sequentially and synthesizing information across slices and views. Likewise, MS-VLM leverages self-supervised 2D transformer encoders to learn a volumetric representation that capture inter-slice dependencies from a sequence of slice-specific features. Unbound by sub-volumetric patchification, MS-VLM is capable of obtaining useful volumetric representations from 3D medical images with any slice length and from multiple images acquired from different planes and phases. We evaluate MS-VLM on publicly available chest CT dataset CT-RATE and in-house rectal MRI dataset. In both scenarios, MS-VLM surpasses existing methods in radiology report generation, producing more coherent and clinically relevant reports. These findings highlight the potential of MS-VLM to advance 3D medical image interpretation and improve the robustness of medical VLMs.

翻译：近期医学视觉语言模型在二维医学影像解读中展现出潜力。然而，由于计算复杂性和数据稀缺性，将其扩展至三维医学影像领域一直面临挑战。尽管近期出现了少数专为三维医学影像设计的视觉语言模型，但它们均局限于将三维医学影像的体素表征学习为一组子体积特征。这一过程会引入沿z轴过度相关的表征，从而忽略切片特有的临床细节——这对于相邻切片冗余度较低的三维医学影像尤为关键。为解决这一局限，我们提出了模拟放射科医生三维医学影像解读工作流程的MS-VLM模型。具体而言，放射科医生通过依次检查单个切片并综合跨切片、跨视图的信息来分析三维医学影像。类似地，MS-VLM利用自监督的二维Transformer编码器，从一系列切片特异性特征中学习能够捕捉切片间依赖关系的体素表征。由于不受子体积分块限制，MS-VLM能够从任意切片长度的三维医学影像，以及从不同平面和时相获取的多幅影像中提取有效的体素表征。我们在公开胸部CT数据集CT-RATE与内部直肠MRI数据集上评估MS-VLM。在两种场景下，MS-VLM在放射学报告生成任务中均超越现有方法，生成更具连贯性与临床相关性的报告。这些发现彰显了MS-VLM在推进三维医学影像解读与增强医学视觉语言模型鲁棒性方面的潜力。