Towards Scalable Language-Image Pre-training for 3D Medical Imaging

The scalability of current language-image pre-training for 3D medical imaging, such as CT and MRI, is constrained by the need for radiologists to manually curate raw clinical studies. In this work, we pioneer pre-training directly on uncurated studies, which both aligns more closely with the radiologist's workflow and provides a natural path to scalability. However, the unique structure of such data presents new challenges for existing model architectures, which were originally designed for 2D slices or single 3D scans. To address this, we introduce a novel hierarchical attention mechanism inspired by the intrinsic hierarchy of radiology data: slice, scan, and study. We denote our framework as Hierarchical attention for Language-Image Pre-training (HLIP). Trained on 220K studies with 3.13 million scans for brain MRI and 240K studies with 1.44 million scans for head CT, HLIP achieves state-of-the-art performance, e.g., +10.5% balanced ACC on the proposed publicly available brain MRI benchmark Pub-Brain-5; +8.3% and +1.7% macro AUC on head CT benchmarks CQ500 and RSNA, respectively. HLIP also exhibits strong generalizability on existing 3D medical language-image pre-training benchmarks, e.g., +4.3% macro AUC on the Rad-ChestCT benchmark when pre-trained on CT-RATE. These results demonstrate that, with HLIP, directly pre-training on uncurated clinical datasets is a scalable and effective direction for language-image pre-training in 3D medical imaging. The code is available at https://github.com/Zch0414/hlip.

翻译：当前针对CT和MRI等三维医学影像的语言-图像预训练的可扩展性，受限于需要放射科医生手动整理原始临床研究数据。在本工作中，我们率先直接在未经整理的研究数据上进行预训练，这不仅更贴近放射科医师的实际工作流程，也为实现可扩展性提供了自然路径。然而，此类数据的独特结构对现有模型架构提出了新的挑战，这些架构最初是为二维切片或单一三维扫描设计的。为解决这一问题，我们受放射学数据固有层次结构（切片、扫描、研究）的启发，提出了一种新颖的分层注意力机制。我们将该框架命名为分层注意力语言-图像预训练（HLIP）。在包含22万项研究、313万次扫描的脑部MRI数据集，以及包含24万项研究、144万次扫描的头部CT数据集上进行训练后，HLIP取得了最先进的性能表现：例如，在提出的公开脑部MRI基准Pub-Brain-5上平衡准确率提升10.5%；在头部CT基准CQ500和RSNA上宏AUC分别提升8.3%和1.7%。HLIP在现有三维医学语言-图像预训练基准上也表现出强大的泛化能力，例如在CT-RATE上预训练后，在Rad-ChestCT基准上宏AUC提升4.3%。这些结果表明，通过HLIP框架，直接在未经整理的临床数据集上进行预训练，是三维医学影像语言-图像预训练中一条可扩展且有效的技术路线。代码公开于https://github.com/Zch0414/hlip。