Recent vision foundation models (VFMs) have demonstrated proficiency in various tasks but require supervised fine-tuning to perform the task of semantic segmentation effectively. Benchmarking their performance is essential for selecting current models and guiding future model developments for this task. The lack of a standardized benchmark complicates comparisons. Therefore, the primary objective of this paper is to study how VFMs should be benchmarked for semantic segmentation. To do so, various VFMs are fine-tuned under various settings, and the impact of individual settings on the performance ranking and training time is assessed. Based on the results, the recommendation is to fine-tune the ViT-B variants of VFMs with a 16x16 patch size and a linear decoder, as these settings are representative of using a larger model, more advanced decoder and smaller patch size, while reducing training time by more than 13 times. Using multiple datasets for training and evaluation is also recommended, as the performance ranking across datasets and domain shifts varies. Linear probing, a common practice for some VFMs, is not recommended, as it is not representative of end-to-end fine-tuning. The benchmarking setup recommended in this paper enables a performance analysis of VFMs for semantic segmentation. The findings of such an analysis reveal that pretraining with promptable segmentation is not beneficial, whereas masked image modeling (MIM) with abstract representations is crucial, even more important than the type of supervision used. The code for efficiently fine-tuning VFMs for semantic segmentation can be accessed through the project page at: https://tue-mps.github.io/benchmark-vfm-ss/.
翻译:近期视觉基础模型(VFMs)在多项任务中展现出卓越能力,但需通过监督微调才能有效执行语义分割任务。建立性能基准对于当前模型的选择以及指导未来模型发展至关重要。由于缺乏标准化基准,模型间的比较变得复杂。因此,本文的主要研究目标在于探讨如何为语义分割任务构建VFMs的基准测试体系。为实现这一目标,我们在多种设置下对不同的VFMs进行微调,并评估了各项设置对性能排序和训练时间的影响。基于实验结果,我们建议采用16×16图像块尺寸和线性解码器对ViT-B架构的VFMs进行微调,因为该设置能够代表使用更大模型、更先进解码器和更小图像块尺寸的综合效果,同时可将训练时间缩短13倍以上。我们还建议使用多个数据集进行训练与评估,因为不同数据集及领域偏移下的性能排序存在差异。对于某些VFMs常用的线性探测方法,因其不能代表端到端微调的实际效果,故不推荐采用。本文推荐的基准测试方案能够实现对VFMs语义分割性能的系统分析。通过该分析发现:基于可提示分割的预训练并无明显益处,而采用抽象表征的掩码图像建模(MIM)则至关重要,其重要性甚至超过所采用的监督类型。用于高效微调VFMs进行语义分割的代码可通过项目页面获取:https://tue-mps.github.io/benchmark-vfm-ss/。