TotalFM：一种面向三维CT视觉基础模型的器官分离框架 (TotalFM: An Organ-Separated Framework for 3D-CT Vision Foundation Models)

While foundation models in radiology are expected to be applied to various clinical tasks, computational cost constraints remain a major challenge when training on 3D-CT volumetric data. In this study, we propose TotalFM, a radiological foundation model that efficiently learns the correspondence between 3D-CT images and linguistic expressions based on the concept of organ separation, utilizing a large-scale dataset of 140,000 series. By automating the creation of organ volume and finding-sentence pairs through segmentation techniques and Large Language Model (LLM)-based radiology report processing, and by combining self-supervised pre-training via VideoMAE with contrastive learning using volume-text pairs, we aimed to balance computational efficiency and representation capability. In zero-shot organ-wise lesion classification tasks, the proposed model achieved higher F1 scores in 83% (5/6) of organs compared to CT-CLIP and 64% (9/14) of organs compared to Merlin. These results suggest that the proposed model exhibits high generalization performance in a clinical evaluation setting using actual radiology report sentences. Furthermore, in zero-shot finding-wise lesion classification tasks, our model achieved a higher AUROC in 83% (25/30) of finding categories compared to Merlin. We also confirmed performance comparable to existing Vision-Language Models (VLMs) in radiology report generation tasks. Our results demonstrate that the organ-separated learning framework can serve as a realistic and effective design guideline for the practical implementation of 3D-CT foundation models.

翻译：尽管放射学基础模型有望应用于多种临床任务，但在三维CT容积数据上进行训练时，计算成本限制仍是主要挑战。本研究提出TotalFM，一种基于器官分离概念、利用14万例序列的大规模数据集高效学习三维CT图像与语言表达对应关系的放射学基础模型。通过分割技术和基于大语言模型（LLM）的放射学报告处理自动创建器官体积与描述句对，并结合VideoMAE的自监督预训练与基于体积-文本对的对比学习，我们旨在平衡计算效率与表征能力。在零样本器官级病变分类任务中，所提模型在83%（5/6）的器官上相比CT-CLIP获得更高的F1分数，在64%（9/14）的器官上相比Merlin获得更高分数。这些结果表明，在使用真实放射学报告语句的临床评估场景中，所提模型展现出较高的泛化性能。此外，在零样本征象级病变分类任务中，相比Merlin，我们的模型在83%（25/30）的征象类别上实现了更高的AUROC。我们还在放射学报告生成任务中验证了与现有视觉-语言模型（VLM）相当的性能。研究结果证明，器官分离学习框架可为三维CT基础模型的实际部署提供切实有效的设计指导。