In this paper, we first assess and harness various Vision Foundation Models (VFMs) in the context of Domain Generalized Semantic Segmentation (DGSS). Driven by the motivation that Leveraging Stronger pre-trained models and Fewer trainable parameters for Superior generalizability, we introduce a robust fine-tuning approach, namely Rein, to parameter-efficiently harness VFMs for DGSS. Built upon a set of trainable tokens, each linked to distinct instances, Rein precisely refines and forwards the feature maps from each layer to the next layer within the backbone. This process produces diverse refinements for different categories within a single image. With fewer trainable parameters, Rein efficiently fine-tunes VFMs for DGSS tasks, surprisingly surpassing full parameter fine-tuning. Extensive experiments across various settings demonstrate that Rein significantly outperforms state-of-the-art methods. Remarkably, with just an extra 1% of trainable parameters within the frozen backbone, Rein achieves a mIoU of 68.1% on the Cityscapes, without accessing any real urban-scene datasets.
翻译:本文首先在领域泛化语义分割(DGSS)背景下评估并利用多种视觉基础模型(VFM)。基于“利用更强的预训练模型和更少的可训练参数以获得更优泛化能力”的动机,我们提出了一种名为Rein的鲁棒微调方法,以参数高效的方式利用VFM进行DGSS。该方法基于一组可训练标记(每个标记与不同实例相关联),在骨干网络内精确优化并将特征图从每一层传递至下一层。该过程可对单张图像中不同类别产生多样化优化效果。凭借更少的可训练参数,Rein高效微调VFM完成DGSS任务,令人惊讶地超越了全参数微调。跨多种设置的大量实验表明,Rein显著优于现有最优方法。值得注意的是,在冻结骨干网络内仅额外使用1%可训练参数的情况下,Rein在Cityscapes数据集上实现了68.1%的mIoU,且无需访问任何真实城市场景数据集。