Foundation models leverage large-scale pretraining to capture extensive knowledge, demonstrating generalization in a wide range of language tasks. By comparison, vision foundation models (VFMs) often exhibit uneven improvements across downstream tasks, despite substantial computational investment. We postulate that this limitation arises from a mismatch between pretraining objectives and the demands of downstream vision-and-imaging tasks. Pretraining strategies like masked image reconstruction or contrastive learning shape representations for tasks such as recovery of generic visual patterns or global semantic structures, which may not align with the task-specific requirements of downstream applications including segmentation, classification, or image synthesis. To investigate this in a concrete real-world clinical area, we assess two VFMs, a reconstruction-focused MAE-based model (ProFound) and a contrastive-learning-based model (ProViCNet), on five prostate multiparametric MR imaging tasks, examining how such task alignment influences transfer performance, i.e., from pretraining to fine-tuning. Our findings indicate that better alignment between pretraining and downstream tasks, measured by simple divergence metrics such as maximum-mean-discrepancy (MMD) between the same features before and after fine-tuning, correlates with greater performance improvements and faster convergence, emphasizing the importance of designing and analyzing pretraining objectives with downstream applicability in mind.
翻译:基础模型通过大规模预训练捕获广泛知识,在多种语言任务中展现出泛化能力。相比之下,视觉基础模型(VFMs)尽管投入了大量计算资源,却常在下游任务中表现出不均衡的性能提升。我们推测这种局限性源于预训练目标与下游视觉成像任务需求之间的不匹配。掩码图像重建或对比学习等预训练策略所塑造的表征,适用于恢复通用视觉模式或全局语义结构等任务,但可能无法满足下游应用(如分割、分类或图像合成)的具体任务需求。为在真实临床场景中具体研究这一问题,我们在五项前列腺多参数磁共振成像任务上评估了两个VFMs:一个基于重建的MAE模型(ProFound)和一个基于对比学习的模型(ProViCNet),探究此类任务对齐如何影响从预训练到微调的迁移性能。研究结果表明,预训练与下游任务之间更好的对齐(通过简单散度度量如微调前后相同特征的最大均值差异(MMD)来衡量)与更大的性能提升和更快的收敛速度相关,这强调了在设计和分析预训练目标时需考虑下游适用性的重要性。