How to Benchmark Vision Foundation Models for Semantic Segmentation?

Recent vision foundation models (VFMs) have demonstrated proficiency in various tasks but require supervised fine-tuning to perform the task of semantic segmentation effectively. Benchmarking their performance is essential for selecting current models and guiding future model developments for this task. The lack of a standardized benchmark complicates comparisons. Therefore, the primary objective of this paper is to study how VFMs should be benchmarked for semantic segmentation. To do so, various VFMs are fine-tuned under various settings, and the impact of individual settings on the performance ranking and training time is assessed. Based on the results, the recommendation is to fine-tune the ViT-B variants of VFMs with a 16x16 patch size and a linear decoder, as these settings are representative of using a larger model, more advanced decoder and smaller patch size, while reducing training time by more than 13 times. Using multiple datasets for training and evaluation is also recommended, as the performance ranking across datasets and domain shifts varies. Linear probing, a common practice for some VFMs, is not recommended, as it is not representative of end-to-end fine-tuning. The benchmarking setup recommended in this paper enables a performance analysis of VFMs for semantic segmentation. The findings of such an analysis reveal that pretraining with promptable segmentation is not beneficial, whereas masked image modeling (MIM) with abstract representations is crucial, even more important than the type of supervision used. The code for efficiently fine-tuning VFMs for semantic segmentation can be accessed through the project page at: https://tue-mps.github.io/benchmark-vfm-ss/.

翻译：近期视觉基础模型（VFMs）虽在多项任务中展现出卓越能力，但需通过监督微调才能有效执行语义分割任务。对其性能进行基准测试对于选择当前模型及指导该任务的未来模型开发至关重要。然而标准化基准测试的缺失使模型比较变得复杂。为此，本文旨在研究如何为语义分割任务建立VFMs的基准测试方法。研究中，我们在多种设置下对各类VFMs进行微调，评估了不同设置对性能排名与训练时间的影响。基于实验结果，我们建议使用16x16分块大小和线性解码器微调VFMs的ViT-B变体，因为这些设置既能代表更大规模模型、更先进解码器及更小分块配置的性能表现，又可缩短超过13倍的训练时间。同时建议采用多数据集进行训练与评估，因跨数据集及域迁移场景下的性能排名差异显著。线性探测（部分VFMs常用方法）因无法代表端到端微调效果而不被推荐。本文推荐的基准测试配置可有效分析语义分割任务中VFMs的性能表现。分析结果表明：基于提示分割的预训练方法并无优势，而采用抽象表征的掩码图像建模（MIM）则至关重要，其影响力甚至超过监督类型。用于高效微调VFMs语义分割的代码已发布于项目页面：https://tue-mps.github.io/benchmark-vfm-ss/。