Learning Robust Visual Features in Computed Tomography Enables Efficient Transfer Learning for Clinical Tasks

There is substantial interest in developing artificial intelligence systems to support radiologists across tasks ranging from segmentation to report generation. Existing computed tomography (CT) foundation models have largely focused on building generalist vision-language systems capable of tasks such as question answering and report generation. However, training reliable vision-language systems requires paired image-text data at a scale that remains unavailable in CT. Moreover, adapting the underlying visual representations to downstream tasks typically requires partial or full backbone fine-tuning, a computationally demanding process inaccessible to many research groups. Instead, foundation models should prioritise learning robust visual representations that enable efficient transfer to new tasks with minimal labelled data and without backbone fine-tuning. We present VoxelFM, a 3D CT foundation model trained with self-distillation using the DINO framework, which learns semantically rich features without language supervision. We evaluated VoxelFM across seven categories of clinically relevant downstream tasks using frozen backbone representations with lightweight probes: classification, regression, survival analysis, instance retrieval, localisation, segmentation, and report generation. VoxelFM matched or outperformed four existing CT foundation models across all task categories. Despite receiving no language supervision during pre-training, VoxelFM surpassed models explicitly trained with language-alignment objectives, including on report generation. Our results indicate that current CT foundation models perform significantly better as feature extractors for lightweight probes rather than as vision encoders for vision-language models. Model weights and training code are publicly available.

翻译：在开发人工智能系统以辅助放射科医生完成从分割到报告生成等任务方面存在广泛兴趣。现有的计算机断层扫描（CT）基础模型主要致力于构建能够执行问答和报告生成等任务的通用视觉-语言系统。然而，训练可靠的视觉-语言系统需要大规模的图文配对数据，这在CT领域尚不可得。此外，将底层视觉表征适配到下游任务通常需要部分或全部骨干网络的微调，这一计算密集型过程令许多研究团队望而却步。相反，基础模型应优先学习鲁棒的视觉表征，使其能够以最少的标注数据且无需骨干网络微调即可高效迁移到新任务。我们提出VoxelFM，这是一种采用DINO框架通过自蒸馏训练的3D CT基础模型，无需语言监督即可学习语义丰富的特征。我们使用冻结的骨干网络表征搭配轻量级探测头，在七类临床相关下游任务上评估了VoxelFM：分类、回归、生存分析、实例检索、定位、分割和报告生成。在所有任务类别中，VoxelFM均达到或超越了四种现有CT基础模型。尽管预训练阶段未接受任何语言监督，VoxelFM仍超越了明确使用语言对齐目标训练的模型，包括在报告生成任务上的表现。我们的结果表明，当前CT基础模型作为轻量级探测头的特征提取器，其性能显著优于作为视觉-语言模型的视觉编码器。模型权重和训练代码均已公开。