We introduced SSR, which utilizes SAM (segment-anything) as a strong regularizer during training, to greatly enhance the robustness of the image encoder for handling various domains. Specifically, given the fact that SAM is pre-trained with a large number of images over the internet, which cover a diverse variety of domains, the feature encoding extracted by the SAM is obviously less dependent on specific domains when compared to the traditional ImageNet pre-trained image encoder. Meanwhile, the ImageNet pre-trained image encoder is still a mature choice of backbone for the semantic segmentation task, especially when the SAM is category-irrelevant. As a result, our SSR provides a simple yet highly effective design. It uses the ImageNet pre-trained image encoder as the backbone, and the intermediate feature of each stage (ie there are 4 stages in MiT-B5) is regularized by SAM during training. After extensive experimentation on GTA5$\rightarrow$Cityscapes, our SSR significantly improved performance over the baseline without introducing any extra inference overhead.
翻译:我们引入了SSR,该方法在训练过程中利用SAM(分割一切模型)作为强正则化器,显著增强了图像编码器对多种域处理的鲁棒性。具体而言,鉴于SAM是通过互联网上海量图像(涵盖多种不同域)预训练的,相较于传统基于ImageNet预训练的图像编码器,SAM编码的特征对特定域的依赖性明显更弱。同时,ImageNet预训练图像编码器仍是语义分割任务中成熟的骨干网络选择,尤其当SAM与具体类别无关时。因此,我们提出的SSR提供了一种简洁而高效的设计:采用ImageNet预训练图像编码器作为骨干网络,并在训练过程中通过SAM对每个阶段(即MiT-B5中的4个阶段)的中间特征进行正则化。在GTA5→Cityscapes数据集上的大量实验表明,SSR在未引入任何额外推理开销的情况下,显著提升了基线模型的性能。