Merlin: A Vision Language Foundation Model for 3D Computed Tomography

Louis Blankemeier,Joseph Paul Cohen,Ashwin Kumar,Dave Van Veen,Syed Jamal Safdar Gardezi,Magdalini Paschali,Zhihong Chen,Jean-Benoit Delbrouck,Eduardo Reis,Cesar Truyts,Christian Bluethgen,Malte Engmann Kjeldskov Jensen,Sophie Ostmeier,Maya Varma,Jeya Maria Jose Valanarasu,Zhongnan Fang,Zepeng Huo,Zaid Nabulsi,Diego Ardila,Wei-Hung Weng,Edson Amaro Junior,Neera Ahuja,Jason Fries,Nigam H. Shah,Andrew Johnston,Robert D. Boutin,Andrew Wentland,Curtis P. Langlotz,Jason Hom,Sergios Gatidis,Akshay S. Chaudhari

from arxiv, 18 pages, 7 figures

Over 85 million computed tomography (CT) scans are performed annually in the US, of which approximately one quarter focus on the abdomen. Given the current radiologist shortage, there is a large impetus to use artificial intelligence to alleviate the burden of interpreting these complex imaging studies. Prior state-of-the-art approaches for automated medical image interpretation leverage vision language models (VLMs). However, current medical VLMs are generally limited to 2D images and short reports, and do not leverage electronic health record (EHR) data for supervision. We introduce Merlin - a 3D VLM that we train using paired CT scans (6+ million images from 15,331 CTs), EHR diagnosis codes (1.8+ million codes), and radiology reports (6+ million tokens). We evaluate Merlin on 6 task types and 752 individual tasks. The non-adapted (off-the-shelf) tasks include zero-shot findings classification (31 findings), phenotype classification (692 phenotypes), and zero-shot cross-modal retrieval (image to findings and image to impressions), while model adapted tasks include 5-year disease prediction (6 diseases), radiology report generation, and 3D semantic segmentation (20 organs). We perform internal validation on a test set of 5,137 CTs, and external validation on 7,000 clinical CTs and on two public CT datasets (VerSe, TotalSegmentator). Beyond these clinically-relevant evaluations, we assess the efficacy of various network architectures and training strategies to depict that Merlin has favorable performance to existing task-specific baselines. We derive data scaling laws to empirically assess training data needs for requisite downstream task performance. Furthermore, unlike conventional VLMs that require hundreds of GPUs for training, we perform all training on a single GPU.

翻译：美国每年进行超过8500万次计算机断层扫描（CT），其中约四分之一聚焦于腹部。鉴于当前放射科医师短缺，利用人工智能减轻解读这些复杂影像研究的负担具有重要推动意义。先前最先进的自动化医学影像解读方法采用视觉语言模型。然而，当前医学视觉语言模型通常局限于二维图像和简短报告，且未利用电子健康记录数据进行监督。本文提出Merlin——一个通过配对CT扫描（来自15,331例CT的600余万张图像）、电子健康记录诊断代码（180余万条代码）和放射学报告（600余万词元）训练的三维视觉语言模型。我们在6类任务共752项具体任务上评估Merlin性能。非适配任务包括零样本征象分类（31种征象）、表型分类（692种表型）以及零样本跨模态检索（图像到征象与图像到印象），模型适配任务涵盖五年疾病预测（6种疾病）、放射学报告生成和三维语义分割（20个器官）。我们在包含5,137例CT的内部测试集进行验证，并在7,000例临床CT及两个公开CT数据集（VerSe、TotalSegmentator）进行外部验证。除临床相关评估外，我们通过分析不同网络架构与训练策略的效果，证明Merlin相较于现有任务专用基线模型具有更优性能。我们推导数据缩放定律以实证评估下游任务性能所需的训练数据量。此外，与传统视觉语言模型需要数百个GPU进行训练不同，我们仅使用单个GPU完成全部训练。