This paper explores the capability of ViT-based models under the generalized few-shot semantic segmentation (GFSS) framework. We conduct experiments with various combinations of backbone models, including ResNets and pretrained Vision Transformer (ViT)-based models, along with decoders featuring a linear classifier, UPerNet, and Mask Transformer. The structure made of DINOv2 and linear classifier takes the lead on popular few-shot segmentation bench mark PASCAL-$5^i$, substantially outperforming the best of ResNet structure by 116% in one-shot scenario. We demonstrate the great potential of large pretrained ViT-based model on GFSS task, and expect further improvement on testing benchmarks. However, a potential caveat is that when applying pure ViT-based model and large scale ViT decoder, the model is easy to overfit.
翻译:本文探讨了基于ViT的模型在广义少样本语义分割框架下的性能表现。我们采用多种主干网络与解码器的组合进行实验,包括ResNet、预训练的Vision Transformer模型,以及配备线性分类器、UPerNet和Mask Transformer的解码器。由DINOv2与线性分类器构成的架构在主流少样本分割基准PASCAL-$5^i$上表现最优,在单样本场景中以116%的显著优势超越最佳ResNet架构。本研究证明了大规模预训练ViT模型在GFSS任务中的巨大潜力,并预期在测试基准上能取得进一步改进。然而需注意,当采用纯ViT架构模型与大规模ViT解码器时,模型容易出现过拟合现象。