Recent advancements in foundation models, typically trained with self-supervised learning on large-scale and diverse datasets, have shown great potential in medical image analysis. However, due to the significant spatial heterogeneity of medical imaging data, current models must tailor specific structures for different datasets, making it challenging to leverage the abundant unlabeled data. In this work, we propose a universal foundation model for medical image analysis that processes images with heterogeneous spatial properties using a unified structure. To accomplish this, we propose spatially adaptive networks (SPAD-Nets), a family of networks that dynamically adjust the structures to adapt to the spatial properties of input images, to build such a universal foundation model. We pre-train a spatial adaptive visual tokenizer (SPAD-VT) and then a spatial adaptive Vision Transformer (SPAD-ViT) via masked image modeling (MIM) on 55 public medical image datasets. The pre-training data comprises over 9 million image slices, representing the largest, most comprehensive, and most diverse dataset to our knowledge for pre-training universal foundation models for medical image analysis. The experimental results on downstream medical image classification and segmentation tasks demonstrate the superior performance and label efficiency of our model. Our code is available at https://github.com/function2-llx/PUMIT.
翻译:近期,通过在大规模多样化数据集上进行自监督学习训练的基础模型,在医学图像分析领域展现出巨大潜力。然而,由于医学影像数据存在显著的空间异质性,现有模型必须针对不同数据集定制特定结构,导致难以充分利用丰富的无标注数据。本研究提出一种面向医学图像分析的通用基础模型,该模型采用统一结构处理具有异质性空间属性的图像。为实现这一目标,我们提出了空间自适应网络(SPAD-Nets)——一个能根据输入图像的空间属性动态调整结构的网络家族——用于构建此类通用基础模型。我们首先预训练了空间自适应视觉分词器(SPAD-VT),随后通过掩码图像建模(MIM)在55个公开医学影像数据集上预训练了空间自适应视觉Transformer(SPAD-ViT)。预训练数据包含超过900万张图像切片,据我们所知,这是目前用于医学图像分析通用基础模型预训练中规模最大、覆盖面最广、多样性最丰富的数据集。在下游医学图像分类与分割任务上的实验结果表明,我们的模型具有卓越的性能与标签效率。代码已开源:https://github.com/function2-llx/PUMIT。