Due to the increase in computational resources and accessibility of data, an increase in large, deep learning models trained on copious amounts of data using self-supervised or semi-supervised learning have emerged. These "foundation" models are often adapted to a variety of downstream tasks like classification, object detection, and segmentation with little-to-no training on the target dataset. In this work, we perform a robustness analysis of Visual Foundation Models (VFMs) for segmentation tasks and compare them to supervised models of smaller scale. We focus on robustness against real-world distribution shift perturbations.We benchmark four state-of-the-art segmentation architectures using 2 different datasets, COCO and ADE20K, with 17 different perturbations with 5 severity levels each. We find interesting insights that include (1) VFMs are not robust to compression-based corruptions, (2) while the selected VFMs do not significantly outperform or exhibit more robustness compared to non-VFM models, they remain competitively robust in zero-shot evaluations, particularly when non-VFM are under supervision and (3) selected VFMs demonstrate greater resilience to specific categories of objects, likely due to their open-vocabulary training paradigm, a feature that non-VFM models typically lack. We posit that the suggested robustness evaluation introduces new requirements for foundational models, thus sparking further research to enhance their performance.
翻译:由于计算资源的增加和数据的可获取性提升,基于自监督或半监督学习在大规模数据上训练的大型深度学习模型不断涌现。这些"基础"模型常被迁移到分类、目标检测和分割等下游任务中,且仅需在目标数据集上进行极少量训练甚至无需训练。本研究针对视觉基础模型(Visual Foundation Models, VFMs)在分割任务中的鲁棒性展开分析,并将其与较小规模的监督学习模型进行对比。我们重点研究模型对真实场景分布偏移扰动的鲁棒性,采用COCO和ADE20K两个数据集,对四种前沿分割架构在17种不同扰动(每种包含5个严重等级)下进行基准测试。研究发现:1)VFMs对基于压缩的腐蚀扰动缺乏鲁棒性;2)所选VFMs在性能及鲁棒性上并未显著超越非VFM模型,但在零样本评估中仍保持竞争性鲁棒性(尤其当非VFM模型处于监督学习场景时);3)所选VFMs对特定类别目标展现出更强的鲁棒性,这很可能归因于其开放词汇的训练范式——该特征通常为非VFM模型所不具备。我们认为,本研究所提出的鲁棒性评估为基础模型引入了新的性能要求,从而推动相关研究以进一步提升其表现。